HomeRobot: Open-Vocabulary Mobile Manipulation

Open-Vocabulary Mobile Manipulation (OVMM) is the problem of picking any object in any unseen environment, and placing it in a commanded location.

OVMM涉及到了感知、语言理解、导航和操作

an agent navigates household environments to grasp novel objects and place them on target receptacles

“Move (object) from the (start_receptacle) to the (goal_receptacle).”

机器人初始化在未知环境的home environment中，给定object，起始地和目的地，机器人实现对object的搬运：

finding the start_receptacle with the object
picking up the object
finding the goal_receptacle
placing the object on the goal_receptacle.

Open Vocabulary Navigation

[9] Open-vocabulary queryable scene representations for real world planning. arXiv, 2022.

[15] Conceptfusion: Open-set multimodal 3d mapping. arXiv, 2023.

[11] Clip-fields: Weakly supervised semantic fields for robotic memory. arXiv, 2022.

[12] Usa-net: Unified semantic and affordance representations for robot memory. arXiv, 2023.

[16] Beyond the nav-graph: Vision and language navigation in continuous environments. In European Conference on Computer
Vision (ECCV), 2020.

fully open-vocabulary scene representation

[11] Clip-fields: Weakly supervised semantic fields for robotic memory. arXiv, 2022.

[12] Usa-net: Unified semantic and affordance representations for robot memory. arXiv, 2023.

[15] Conceptfusion: Open-set multimodal 3d mapping. arXiv, 2023.

object-goal navigation [1, 2], skill learning [24], continual learning [61], and image instance navigation [25]

[24] Spatial-language attention policies for efficient robot learning. arXiv, 2023.

[25] Navigating to objects specified by images. arXiv, 2023

[61] Evaluating continual learning on a home robot, 2023.

[1] Objectnav revisited: On evaluation of embodied agents navigating to objects. arXiv, 2020.

Simulation Data

部分object在训练阶段不可见，所有放置物体的容器在训练阶段都是可见的，但是会在测试时放在不同的地方

Habitat Synthetic Scenes Dataset (HSSD)：
- [19] Habitat Synthetic Scenes Dataset: An Analysis of 3D Scene Scale and Realism Tradeoffs for ObjectGoal Navigation. arXiv, 2023
- 200+Scenes，18K objects
- 本文使用HSSD的子集，由60个场景组成，38个用于训练，12用于验证和10用于测试
总共注释了129个类别的2535个对象，标注了HSSD数据集中出现的21个不同类别的receptacle,并对receptacle进行了进一步标注
Episode Generation Details
- 生成一系列candidate viewpoints，每个viewpoints于一个特定的start_receptacle或goal_receptacle相关联，也就是代表了机器人可以看见receptacle的一个邻近位置（小于1.5米）
  
  灰色代表可导航区域，大红点和黑框分别是对象的中心和边界框。周围的点是candidate viewpoints：红点被拒绝是因为它们不可导航，而蓝点被拒绝是因为它们离对象太远。绿色是the final set of viewpoints。

objects categories

Baseline Agent Implementation

OVMMAgent

将完成OVMM任务需要的技能分解为：

FindObj/FindRec: 对物体和容器进行定位
Gaze: 移动到足够接近对象以抓住它，并调整头部方向以获得对象的良好视图。凝视的目的是提高抓取的成功率。
Grasp: 捡起物体，只提供了high-level的action
Place: Move to a location in the environment and place the object on top of the goal_receptacle.

heuristic baseline

use a well known motion planning technique [2] and simple rules to execute grasping and manipulation actions

[2] Navigating to objects in the real world. arXiv, 2022.

reinforcement learning baseline

learn exploration and manipulation skills using an off-the-self policy learning algorithm, DDPPO

While RGB is available in our simulation, our baseline policies do not directly utilize it; instead, they rely on predicted segmentation from Detic at test time

Action Space Implementation

In simulation, this is implemented as a check against the navmesh - we use the navmesh to determine if the robot will go into collision with any objects if moved towards the new location, and move it to the closest valid location instead.

HomeRobot Implementation Details

three different repositories within the open-source HomeRobot library:

home_robot: Shared components such as Environment interfaces, controllers, detection and segmentation modules.
home_robot_sim: Simulation stack with Environments based on Habitat.
home_robot_hw: Hardware stack with server processes that runs on the robot, client API that runs on the GPU workstation, and Environments built using the client API.

Pose Infomation

We get the global robot pose from Hector SLAM [99] on the Hello Robot Stretch [22], which is used when creating 2d semantic maps for our model-based navigation policies.

Heuristic Grasping Policy

启发式

Heuristic Placement Policy

Semantic Mapping Module和20年文章一样

使用启发式的探索策略

Navigation Planner使用Fast Marching Method

由于较远时可能观察不到要找的小物体，因此在active neural mapping的planning policy基础上添加了两条规则：

开放类别意味着更难实现抓取和避免碰撞
Fast Marching Method输出轨迹的是离散的，且没有考虑朝向
无法处理运动障碍物
将robot视作了一个圆柱体

Metrics

Simulation Success Metrics

Results

失败分析

不同阶段所用的时间步数

Limitation

仿真中没有物理模拟抓取
We consider full natural language queries out-of-scope
we do not implement many motion planners in HomeRobot, or task-and-motion-planning with replanning, as would be ideal

Yixuan Pan's Blog

HomeRobot: Open-Vocabulary Mobile Manipulation

HomeRobot: Open-Vocabulary Mobile Manipulation

Simulation Data