Scaling Data Generation in Vision-and-Language Navigation

Posted on 2023-08-07 In paper

ScaleVLN

Scaling Data Generation in Vision-and-Language Navigation

Pre-exploration

[26] Counterfactual vision-and-language navigation via adversarial path sampler. In European Conference on Computer Vision, pages 71–86. Springer, 2020. 2

[72] Learning to navigate unseen environments: Back translation with environmental dropout

[90] Vision-language navigation with self-supervised auxiliary reasoning tasks.

VLN cotinuous environments(R2R-CE)

[5] 1st place solutions for rxr-habitat vision-and-language navigation competition (cvpr 2022)

[30] Bridging the gap between learning in discrete and continuous environments for vision-and-language navigation

[40] Simple and effective synthesis of indoor 3d scenes. arXiv preprint arXiv:2204.02960, 2022. 2

Scaling Data for Learning VLN

Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation

Posted on 2023-08-07 In paper

DUET

Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation

Visual Language Maps for Robot Navigation

Posted on 2023-07-31 Edited on 2023-08-10 In paper

ICRA2023 VLMaps

Visual Language Maps for Robot Navigation

Motivation

Without grounding language onto a spatial representation, these systems may struggle to：

recognize correspondences that associate independent observations of the same object
localize spatial goals e.g., “in between the sofa and the TV”
build persistent representations that can be shared across different embodiments, e.g., mobile robots, drones

两大卖点

对语言和空间的感知理解能力：Localize spatial goals beyond object-centric ones, e.g., “in between the TV and sofa” or “to the right of the chair” or “kitchen area” using code-writing LLMs, expanding beyond capabilities of CoW or LM-Nav.
多智能体协作,不同的robots对应不同的obstacle maps: can be shared among multiple robots with different embodiments to generate new obstacle maps on-the-fly (by using a list of obstacle categories)

Generate new obstacle maps for new embodiments given natural language descriptions of landmark categories that they can or cannot traverse, e.g., “tables” are obstacles for a large mobile robot, but traversable for a drone.

both LM-Nav [13] and CoW [12] are limited to navigating to object landmarks and are less capable to understand finer-grained queries, such as “to the left of the chair” and “in between the TV and the sofa”

ConceptFusion: Open-set Multimodal 3D Mapping

Posted on 2023-07-29 Edited on 2023-09-25 In paper

ConceptFusion

https://concept-fusion.github.io/

Motivation

建立环境的3D map是许多机器人任务的核心，大部分方法建图时都受限于：

Closed-set setting，仅能处理训练时预定好的类别
只能由class name或text prompt进行查询（单模态）

ConceptFusion的三个贡献：
open-set multimodal 3D mapping that constructs map representations queryable by text, image, audio, and click queries in a zero-shot manner
A novel mechanism to compute pixel-aligned (local) features from foundation models that can only generate image-level (global) feature vectors
A new RGB-D dataset, UnCoCo, to evaluate open-set multimodal 3D mapping. UnCoCo comprises 78 common household/office objects tagged with more than 500K queries across modalities

CLIP-Fields: Weakly Supervised Semantic Fields for Robotic Memory

Posted on 2023-07-29 Edited on 2023-07-31 In paper

CLIP-Fields

CLIP-Fields: Weakly Supervised Semantic Fields for Robotic Memory

project

CLIP-Fields提出了隐式的场景表示，可以用于segmentation, instance identification, semantic search over space, and view localization
使用CLIP, Detic, and Sentence-BERT来提供弱监督来建立semantic neural fields，不需要直接的人工标注监督
更好的few-shot性能
使用CLIP-Fields作为scene memory可以使机器人在真实世界实现语义导航

Open-vocabulary Queryable Scene Representations for Real World Planning

Posted on 2023-07-28 Edited on 2023-08-19 In paper

Natural-Language Map (NLMap)

Open-vocabulary Queryable Scene Representations for Real World Planning

用于LLM的上下文场景表示，Open-vocabulary Queryable，使LLMs-based planner更好接地，提供scene-scale
affordance grounding

3D-LLM: Injecting the 3D World into Large Language Models

Posted on 2023-07-25 Edited on 2023-08-24 In paper

3D-LLM

3D-LLM: Injecting the 3D World into Large Language Models

MLLMs如Flamingo，BLIP-2能够对2D image进行理解和推理，但是无法与3D物理世界接地，因此无法利用3D中的丰富信息，如空间关系，物理和交互等。

通过将场景的3D表示作为输入，使3D-LLM具有2个优势：

关于整个场景的长期记忆可以存储在整体3D表示中，而不是片段的部分视图观察
3D属性（如affordances和空间关系）可以从3D表示中推理出来，远超出了基于语言或基于2D图像的LLM的能力范围

本文的贡献点：

3D data paired with language description难以获取，本文提出了一个数据生成pipeline来生成大规模的3D data paired with language，最终得到300k的3D-language data（多任务）
如何提取能和语言特征对齐的3D特征？
- 使用渲染的2D多视图图像的特征构造3D特征
- 将3D转换为2D，然后就可以使用现成的VLMs如BLIP-2
如何感知3D空间位置关系
- 3D localization mechanism that bridges the gap between language and spatial locations

Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts

Posted on 2023-07-25 Edited on 2023-07-26 In paper

Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts

Motivation

VLMs通常通过设计与数据集相关的prompt来zero-shot的适配下游数据集

将CLIP用于一个特定domain时，一种直白的prompt engineering方式是：

to classify bird images, one could construct a prompt ‘a photo of ${classname}$, a type of bird’.

但这样的prompt实际上并非最优方式：

不能引入有效的专家知识，无法获取target domain的domain expertise
high variance：prompt小的变化就会引起性能的较大变化
判别性的信息仅仅只有$classname$，而仅依赖$classname$无法完成一些类别的zero-shot的识别，比如name中带有green的green heron但是在外形上更接近black bittern而不是green woodpecker

因此需要引入visually descriptive textual (VDT) information

we define VDT as a set of sentences that describe the visual features of the class under consideration including shape, size, color, environment, patterns, composition, etc.

对于CUB，数据集中提供了domain expertise，其他数据集没有，因此引入LLM来构建复杂prompt，作者采用GPT-4来构建关于类的视觉描述性文本信息，prompt中特别强调了形状、颜色、结构和组成性等视觉线索。

对于high variance问题：作者使用Prompt ensembling the VDT sentences reduce CLIP’s performance sensitivity to small changes in the prompt.

PROTO-CLIP: Vision-Language Prototypical Network for Few-Shot Learning

Posted on 2023-07-24 Edited on 2023-10-25 In paper

PROTO-CLIP

PROTO-CLIP: Vision-Language Prototypical Network for Few-Shot Learning

Motivation

在机器人感知的setting下：

object model-based方法建立物体的3D模型，并用3D模型来进行物体识别，但真实世界中难以获取大规模的3D模型
object catagory-based方法识别有限类别的物体，很难对每个类别都收集大量的图像
ImageNet和Visual Genome这类Internet data和robot manipulation存在domain gap

因此从少量的图片示例中学习如何识别一个新的物体对于scale up机器人可识别的物体非常重要

基于few-shot training images同时adapt图像和文本编码器，在adapt过程中对齐图像原型和文本原型，以提升few-shot classification性能

Navigating to Objects in the Real World

Posted on 2023-07-23 In paper

Navigating to Objects in the Real World

任务难点：

机器人需要具备区分自由空间和障碍物的空间场景理解能力
检测物体的语义场景理解
还需要学习语义探索先验。例如，如果一个人想在这个场景中找到厕所，我们大多数人都会选择走廊，因为它最有可能通向厕所。向agent教授这种常识或语义先验是富有挑战的。
在探索场景寻找所需物体的同时，机器人还需要长期情景记忆来记住已探索和未探索的区域。

本文关注了真实世界中的寻物导航问题，比较了经典，端到端和模块化学习三种方法

ScaleVLN

Related Work

Scaling Data for Learning VLN

DUET

ICRA2023 VLMaps

Motivation

ConceptFusion

Motivation

CLIP-Fields

Natural-Language Map (NLMap)

3D-LLM

Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts

Motivation

PROTO-CLIP

Motivation

Navigating to Objects in the Real World