Learning to Explore using Active Neural SLAM

Posted on 2023-07-11 Edited on 2023-07-18 In paper

Learning to Explore using Active Neural SLAM

Motivation

未知环境中导航的核心问题是探索，即，如何有效地访问尽可能多的环境，最大化覆盖环境范围以给予在未知环境中找到目标的机会，或在有限的时间预算上对环境进行有效的预建图。

为了在未知环境中进行高效主动探索，agent需要知道：

where it has been before (i.e. Mapping)
where it is now (i.e. Pose Estimation)
where it needs to go (i.e. Planning)

如何训练自主探索的agents？

一种方法是使用端到端的深度强化学习，但是使用端到端的方式隐式的学习mapping，pose estimation和planning是昂贵的，且样本效率低下，因此强化学习的方法不适用于large environment中

本文提出的方法通过learning的方法来学习：

室内环境的结构特点
对状态估计误差的鲁棒性；
灵活的输入模态

Meta Omnium: A Benchmark for General-Purpose Learning-to-Learn

Posted on 2023-07-11 Edited on 2023-07-13 In paper

Meta Omnium

Meta Omnium是一个跨多个视觉任务的数据集，包括识别，关键点定位，语义分割和回归。提供统一框架，用于以一致的方式在广泛的视觉应用中评估meta-learners

现有的benchmarks仅测试meta-learners在分类或密集预测等任务中学习的能力。Meta Omnium独特地测试了元学习者跨多种任务类型学习的能力。
涵盖多个视觉领域（从自然图像到医学和工业图像）
提供了全面评估分布内和分布外泛化的能力
明确的超参数调整和模型选择协议，以促进元学习算法之间的公平比较
具有适中的计算成本，使其可用于资源适度的大学和大型机构的研究

传统的within-task meta-learning benchmarks更多地依赖于共同的表征学习而不是learning-to-learn，Meta Omnium更好地测试了learning-to-learn的能力，因为组成的任务需要更多样化的表征

baseline具有最小的task-specific decoders，评估learning-to-learn的能力，而不是融入先验

ETPNav: Evolving Topological Planning for Vision-Language Navigation in Continuous Environments

Posted on 2023-07-10 Edited on 2023-07-11 In paper

ETPNav

ETPNav: Evolving Topological Planning for Vision-Language Navigation in Continuous Environments

github

the capability to abstract environments and generate long-range navigation plans
the ability of obstacle-avoiding control in continuous environments

EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought

Posted on 2023-07-10 In paper

EmbodiedGPT

EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought

基于具身思维链的视觉语言预训练

参考

目前的大模型大多数通过以人类对话、视觉caption、视觉问答等任务的数据集进行训练，和机器人有较大的domain gap，输出的内容准确规划和可执行的动作的能力还有很大提升空间。本文提出：

EgoCOT: a large-scale embodied planning dataset
高效的通过prefix tuning的方式对LLM在EgoCOT上进行训练
用于从LLM生成的planning queries中提取与任务相关的特征，以形成高层规划和低层控制之间的闭环

GeoVLN: Learning Geometry-Enhanced Visual Representation with Slot Attention for Vision-and-Language Navigation

Posted on 2023-07-10 Edited on 2023-07-12 In paper

GeoVLN

GeoVLN: Learning Geometry-Enhanced Visual Representation with Slot Attention for Vision-and-Language Navigation

Motivation

以往方法的缺陷：

仅依赖于RGB图像，RGB图像只能提供非常有限的2D视觉线索并且缺乏几何信息
独立地处理每个candidate view而不考虑局部空间上下文，导致不准确的决策
自然语言包含高级语义特征，并且指令内的不同短语可以集中于各个方面的视觉信息，例如：纹理，几何。使用原生的注意机制构建跨模态表征会导致次优性能

KERM: Knowledge Enhanced Reasoning for Vision-and-Language Navigation

Posted on 2023-07-10 In paper

KERM

KERM: Knowledge Enhanced Reasoning for Vision-and-Language Navigation

github

Motivation

知识可以提供与视觉内容互补的关键信息，比如对象和关系的抽象信息，这样的信息对于对齐视角图像中的视觉对象是必不可少的。比如图1中的Navigable Candidate2 和 3，agent可以根据知识< light hanging over kitchen island>来判断出应该去2
知识可以提高agent的泛化能力，指令和navigable candidate是在有限的可见环境中学习的，利用知识有利于迁移到未见过的环境
知识增加了VLN模型的能力。随着丰富的概念信息被注入到VLN模型中，许多概念之间的相关性被学习。学习到的相关性有利于视觉和语言对齐，特别是对于具有高级指令的任务

KERM将以区域为中心的知识与导航视图融合，为每一个navigable candidate检索事实（语言描述的知识），作为视觉内容的补充，以获得更好的action prediction

Embodied Task Planning with Large Language Models

Posted on 2023-07-09 Edited on 2023-08-30 In paper

TaPA

Embodied Task Planning with Large Language Models

基于LLMs的具身任务规划

Motivation

受限于数据样本和多样的下游应用，直接在不同的应用环境训练同一个agent是不现实的。LLMs可以在复杂任务的plan generation中为agent提供丰富的语义知识，但LLMs无法感知周围环境，缺乏真实世界的信息，常常会产生无法执行的action sequences。

本文中关注的“不可执行的action”主要指LLMs给出的action提及了不存在的物体，比如人类指令是”Give me some wine”，GPT-3.5产生的action steps为”pouring wine from the bottle to the glass”，但实际的场景中可能并没有”glass”，只有”mug”, 实际上可以执行的指令是”pouring wine from the bottle to the mug”

construct a multimodal dataset containing triplets of indoor scenes, instructions and action plans
- designed prompts+the list of existing objects in the scene输入到GPT3.5，得到instructions和对应的action plans
The generated data is leveraged for grounded plan tuning of pre-trained LLMs

从周围环境接收信息

[37] A persistent spatial semantic representation for high-level natural language instruction execution
[38] Piglet: Language grounding through neuro-symbolic interaction in a 3d world
[39] Grounding language to autonomously-acquired skills via goal generation

prompt engineering

设计任务指令和相应动作的简单示例来对LLMs进行提示，以产生合理的任务计划，并通过构建具有语义相似性的映射来过滤出可执行计划子集

[40] Language models as zero-shot planners: Extracting actionable knowledge for embodied agents

根据action feedback为复杂任务生成决策

[41] Pre-trained language models for interactive decision-making

SayCan和LLM-Planner都通过提取场景的latent features或物体名称来为LLMs提供视觉信息

SayCan只能完成在厨房场景下的任务

LLM-Planner在ALFRED simulator中实现，大部分任务都非常简单，如putting and placing

[14] Do as i can, not as i say: Grounding language in robotic affordances
[15] Llm-planner: Few-shot grounded planning for embodied agents with large language models

Methodology

Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions

Posted on 2023-07-08 Edited on 2023-07-18 In paper

VLN survey

What is an intelligent Embodied AI?
- Understand information from Vision,Text,Audio, etc.
- Make the correct action in the environment.
What can we do to build it from Machine Learning perspective?
- Benchmark：
  - Environment as real as possible.
  - Communication like human.
- Methods:
  - Connecting advancement within single modality.
  - Specific solution.
  - Training sources.

两大难题: 数据稀缺和模型的泛化性低

Definition

Key elements: Environment, Agent, Oracle(Human)
Environment
- Photorealistic
- 3D
- Navigable
Interaction between Agent and Environment
- Environment renders new observation after each action.
- Agent navigate in the environment, with possible object manipulation.
Communication between Oracle and Agent
- Use Natural Language
- Task Instruction
- Dialog ability

agent和oracle使用自然语言进行交流，智能体可以请求引导，oracle可以做出响应回答。agent根据接收到的指令和观察到的环境进行导航并与环境交互以完成任务，同时oracle观察环境和agent的状态，并可以与通过与环境交互的方式来帮助agent。

A Comprehensive Survey on Segment Anything Model for Vision and Beyond

Posted on 2023-07-07 Edited on 2023-07-06 In paper

OvarNet: Towards Open-vocabulary Object Attribute Recognition

Posted on 2023-07-06 Edited on 2023-07-16 In paper

OvarNet

Open-vocabulary Object Attribute Recognition（面向开放词汇的目标检测与属性识别）

使用单一模型对图像中任何类别目标同时进行定位、分类和属性预测

Motivation

进行物体属性预测是对视觉场景理解的很好补充，同时训练两个任务会比将两个任务单独对待获得更好的效果

3个挑战：

已有的基础模型如CLIP是基于image-caption pairs进行训练的，学到的表征更倾向于物体类别，而不是物体属性，这使得直接将CLIP用于属性预测存在不对齐的问题
没有同时包含物体框，语义类别和物体属性三种标注的理想的训练集，只有COCO Attributes数据集提供了这样的标注，但是受限于较小的词汇量（196种attributes，29个类别）
将这三个任务在开放词汇场景下统一到同一个框架中还尚未得到好的探索

Learning to Explore using Active Neural SLAM

Motivation

Meta Omnium

ETPNav

EmbodiedGPT

GeoVLN

Motivation

KERM

Motivation

TaPA

Motivation

Related Work

Methodology

VLN survey

Definition

OvarNet

Motivation