0%

ScaleVLN

Scaling Data Generation in Vision-and-Language Navigation

image-20230807150345976

Pre-exploration

[26] Counterfactual vision-and-language navigation via adversarial path sampler. In European Conference on Computer Vision, pages 71–86. Springer, 2020. 2

[72] Learning to navigate unseen environments: Back translation with environmental dropout

[90] Vision-language navigation with self-supervised auxiliary reasoning tasks.

VLN cotinuous environments(R2R-CE)

[5] 1st place solutions for rxr-habitat vision-and-language navigation competition (cvpr 2022)

[30] Bridging the gap between learning in discrete and continuous environments for vision-and-language navigation

[40] Simple and effective synthesis of indoor 3d scenes. arXiv preprint arXiv:2204.02960, 2022. 2

Scaling Data for Learning VLN

ICRA2023 VLMaps

Visual Language Maps for Robot Navigation

Motivation

Without grounding language onto a spatial representation, these systems may struggle to:

  • recognize correspondences that associate independent observations of the same object
  • localize spatial goals e.g., “in between the sofa and the TV”
  • build persistent representations that can be shared across different embodiments, e.g., mobile robots, drones

两大卖点

  • 对语言和空间的感知理解能力:Localize spatial goals beyond object-centric ones, e.g., “in between the TV and sofa” or “to the right of the chair” or “kitchen area” using code-writing LLMs, expanding beyond capabilities of CoW or LM-Nav.

  • 多智能体协作,不同的robots对应不同的obstacle maps: can be shared among multiple robots with different embodiments to generate new obstacle maps on-the-fly (by using a list of obstacle categories)

    Generate new obstacle maps for new embodiments given natural language descriptions of landmark categories that they can or cannot traverse, e.g., “tables” are obstacles for a large mobile robot, but traversable for a drone.

both LM-Nav [13] and CoW [12] are limited to navigating to object landmarks and are less capable to understand finer-grained queries, such as “to the left of the chair” and “in between the TV and the sofa”

image-20230731135534315

Read more »

ConceptFusion

https://concept-fusion.github.io/

image-20230729091938067

Motivation

建立环境的3D map是许多机器人任务的核心,大部分方法建图时都受限于:

  • Closed-set setting,仅能处理训练时预定好的类别
  • 只能由class name或text prompt进行查询(单模态)

    ConceptFusion的三个贡献:

  • open-set multimodal 3D mapping that constructs map representations queryable by text, image, audio, and click queries in a zero-shot manner

  • A novel mechanism to compute pixel-aligned (local) features from foundation models that can only generate image-level (global) feature vectors
  • A new RGB-D dataset, UnCoCo, to evaluate open-set multimodal 3D mapping. UnCoCo comprises 78 common household/office objects tagged with more than 500K queries across modalities
Read more »

CLIP-Fields

CLIP-Fields: Weakly Supervised Semantic Fields for Robotic Memory

project

  • CLIP-Fields提出了隐式的场景表示,可以用于segmentation, instance identification, semantic search over space, and view localization
  • 使用CLIP, Detic, and Sentence-BERT来提供弱监督来建立semantic neural fields,不需要直接的人工标注监督
  • 更好的few-shot性能
  • 使用CLIP-Fields作为scene memory可以使机器人在真实世界实现语义导航

image-20230730211119025

Read more »

3D-LLM

3D-LLM: Injecting the 3D World into Large Language Models

MLLMs如Flamingo,BLIP-2能够对2D image进行理解和推理,但是无法与3D物理世界接地,因此无法利用3D中的丰富信息,如空间关系,物理和交互等。

通过将场景的3D表示作为输入,使3D-LLM具有2个优势:

  • 关于整个场景的长期记忆可以存储在整体3D表示中,而不是片段的部分视图观察
  • 3D属性(如affordances和空间关系)可以从3D表示中推理出来,远超出了基于语言或基于2D图像的LLM的能力范围

本文的贡献点:

  • 3D data paired with language description难以获取,本文提出了一个数据生成pipeline来生成大规模的3D data paired with language,最终得到300k的3D-language data(多任务)
  • 如何提取能和语言特征对齐的3D特征?
    • 使用渲染的2D多视图图像的特征构造3D特征
    • 将3D转换为2D,然后就可以使用现成的VLMs如BLIP-2
  • 如何感知3D空间位置关系
    • 3D localization mechanism that bridges the gap between language and spatial locations

image-20230725122839668

Read more »

Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts

Motivation

VLMs通常通过设计与数据集相关的prompt来zero-shot的适配下游数据集

将CLIP用于一个特定domain时,一种直白的prompt engineering方式是:

to classify bird images, one could construct a prompt ‘a photo of ${classname}$, a type of bird’.

但这样的prompt实际上并非最优方式:

  • 不能引入有效的专家知识,无法获取target domain的domain expertise
  • high variance:prompt小的变化就会引起性能的较大变化
  • 判别性的信息仅仅只有$classname$,而仅依赖$classname$无法完成一些类别的zero-shot的识别,比如name中带有green的green heron但是在外形上更接近black bittern而不是green woodpecker

因此需要引入visually descriptive textual (VDT) information

we define VDT as a set of sentences that describe the visual features of the class under consideration including shape, size, color, environment, patterns, composition, etc.

对于CUB,数据集中提供了domain expertise,其他数据集没有,因此引入LLM来构建复杂prompt,作者采用GPT-4来构建关于类的视觉描述性文本信息,prompt中特别强调了形状、颜色、结构和组成性等视觉线索。

对于high variance问题:作者使用Prompt ensembling the VDT sentences reduce CLIP’s performance sensitivity to small changes in the prompt.

image-20230725215455708

Read more »

PROTO-CLIP

PROTO-CLIP: Vision-Language Prototypical Network for Few-Shot Learning

Motivation

在机器人感知的setting下:

  • object model-based方法建立物体的3D模型,并用3D模型来进行物体识别,但真实世界中难以获取大规模的3D模型
  • object catagory-based方法识别有限类别的物体,很难对每个类别都收集大量的图像
  • ImageNet和Visual Genome这类Internet data和robot manipulation存在domain gap

因此从少量的图片示例中学习如何识别一个新的物体对于scale up机器人可识别的物体非常重要

image-20230724184706199

基于few-shot training images同时adapt图像和文本编码器,在adapt过程中对齐图像原型和文本原型,以提升few-shot classification性能

Read more »

Navigating to Objects in the Real World

img

任务难点:

  • 机器人需要具备区分自由空间和障碍物的空间场景理解能力
  • 检测物体的语义场景理解
  • 还需要学习语义探索先验。例如,如果一个人想在这个场景中找到厕所,我们大多数人都会选择走廊,因为它最有可能通向厕所。向agent教授这种常识或语义先验是富有挑战的。
  • 在探索场景寻找所需物体的同时,机器人还需要长期情景记忆来记住已探索和未探索的区域。

本文关注了真实世界中的寻物导航问题,比较了经典,端到端和模块化学习三种方法

img

Read more »