ConceptFusion

https://concept-fusion.github.io/

Motivation

建立环境的3D map是许多机器人任务的核心，大部分方法建图时都受限于：

Closed-set setting，仅能处理训练时预定好的类别
只能由class name或text prompt进行查询（单模态）

ConceptFusion的三个贡献：
open-set multimodal 3D mapping that constructs map representations queryable by text, image, audio, and click queries in a zero-shot manner
A novel mechanism to compute pixel-aligned (local) features from foundation models that can only generate image-level (global) feature vectors
A new RGB-D dataset, UnCoCo, to evaluate open-set multimodal 3D mapping. UnCoCo comprises 78 common household/office objects tagged with more than 500K queries across modalities

CLIP将图像整体和文本对齐

LSeg, OpenSeg, OVSeg等工作则是通过训练或微调来实现更精细的区域-文本对齐，可能导致不常见类的遗忘

MaskCLIP为了保留CLIP的知识，提出了zero-shot的方式，但在对象边界和长尾概念上表现不佳

VLMaps[44] Visual language maps for robot navigation. ICRA 2023

LM-Nav[45] Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action. In CoRR, 2022

CoWs[46] Clip on wheels: Zero-shot object navigation as object localization and exploration. 2022

NLMap-Saycan[47] Open-vocabulary queryable scene representations for real world planning. 2022

OpenScene[48] Open-scene: 3d scene understanding with open vocabularies. 2022

[49] Language-driven open-vocabulary 3d scene understanding.

[50] Feature-realistic neural fusion for real-time, open set scene understanding

[51] Semantic abstraction: Open-world 3D scene understanding from 2D vision-language models. In Proceedings of the 2022 Conference on Robot Learning, 2022.

Methodology

Fusing pixel-aligned foundation features to 3D

Map representation

使用点云来表示3D map，每个点的属性包括：

vertex position
normal vector
confidence count
3D color vector
concept vector

Frame preprocessing

在获得一帧RGB-D输入后，首先计算vertex-normal maps并估计相机位姿，然后为每个像素点都计算semantic context embedding。

Yixuan Pan's Blog

ConceptFusion: Open-set Multimodal 3D Mapping

ConceptFusion

Motivation

Methodology

Fusing pixel-aligned foundation features to 3D

Map representation

Frame preprocessing

Feature fusion

ConceptFusion

Motivation

Related Work

Methodology

Fusing pixel-aligned foundation features to 3D

Map representation

Frame preprocessing

Feature fusion