3D-LLM

3D-LLM: Injecting the 3D World into Large Language Models

MLLMs如Flamingo，BLIP-2能够对2D image进行理解和推理，但是无法与3D物理世界接地，因此无法利用3D中的丰富信息，如空间关系，物理和交互等。

通过将场景的3D表示作为输入，使3D-LLM具有2个优势：

关于整个场景的长期记忆可以存储在整体3D表示中，而不是片段的部分视图观察
3D属性（如affordances和空间关系）可以从3D表示中推理出来，远超出了基于语言或基于2D图像的LLM的能力范围

本文的贡献点：

3D data paired with language description难以获取，本文提出了一个数据生成pipeline来生成大规模的3D data paired with language，最终得到300k的3D-language data（多任务）
如何提取能和语言特征对齐的3D特征？
- 使用渲染的2D多视图图像的特征构造3D特征
- 将3D转换为2D，然后就可以使用现成的VLMs如BLIP-2
如何感知3D空间位置关系
- 3D localization mechanism that bridges the gap between language and spatial locations

perceiver[2]
QFormers[31]
3D and language[5, 49, 8, 20, 1, 15, 24, 49, 3, 21, 19]

Methodology

使用LLM来完成3D-language data的收集

使用text-only GPT来产生数据的3种prompt

boxes-demonstration-instruction based prompting：
- 输入3D场景中房间和物体的bounding boxes(AABB，坐标轴对齐), 以提供场景的语义和空间位置信息，使用不同指令来生成多种数据，并使用0-3个few-shot demonstration examples
- bbox的表示？坐标轴定义？需要看真实的prompt模板（appendix）
ChatCaptioner based prompting：
- 将不同视角的图像输入BLIP-2，然后ChatGPT提问，BLIP-2回答
- [52] Chatgpt asks, blip-2 answers: Automatic questioning towards enriched visual descriptions, 2023.
- 例子中的far right，near left和多视角图像的对应关系？
Revision based prompting：
- 用于不同类型3D data间的转换，比如有了caption，基于caption进行QA

用到的主要数据集：

Objaverse：使用ChatCaptioner来对Objaverse生成3D-related descriptions，800K 3D objects
Scannet：1k 3D indoor scenes，It provides semantics and bounding boxes of the objects in the scenes.
Habitat-Matterport (HM3D)：HM3DSem [47] further adds semantic annotations and bounding boxes for more than 200 scenes of HM3D.

3D-LLM

3D Feature Extractor

建立与language features对齐的3D features

首先从渲染的图片中提取pixel-aligned dense features（follow [26]Conceptfusion: Open-set multimodal 3D mapping）
然后使用三种方法来从渲染图像的特征建立3D特征
- Direct Reconstruction：利用ground-truth camera matrixes直接从rgbd图像（3D data渲染得到的）重建点云，特征被直接映射到3D点云上
  
  This method is suitable for rendered rgbd data with perfect camera poses and intrinsics
- Feature Fusion：使用gradslam将2D feature映射到3D maps中，3D maps中融合了深度，颜色和特征
  - [26] Conceptfusion: Open-set multimodal 3D mapping
  - [28] gradslam: Dense slam meets automatic differentiation. arXiv, 2020.
  This method is suitable for 3D data with noisy depth map renderings, or noisy camera poses and intrinsics
- Neural Field：使用neural voxel field来建立3D表示，每个voxel都具有desity，color和feature，然后使用MSE loss对齐射线中的3D特征和像素中的2D特征。
  - [20] 3D concept learning and reasoning from multi-view images, 2023.
  This method is for 3D data with RGB renderings but no depth data, and noisy camera poses and intrinsics
通过这些方法，就可以得到$$的3D场景表示($N$个点，每个点的特征为$D_v$维)

Training 3D-LLMs

使用2D VLMs来作为backbone作为初始化，而不是从头训练

使用Perceiver架构来处理不同模态的桥接（Perceiver可以处理任意input size的数据），将3D点云数据视作打平的图像数据，送入perceiver，输出固定的尺寸的特征

Perceiver: General perception with iterative attention. In International Conference on Machine Learning, 2021.

we use pretrained 2D VLMs as our backbones, input the aligned 3D features to train 3D-LLMs with our collected 3D-language dataset

3D Localization Mechanism

将xyz三个维度的sin/cos位置编码concat到3D features中，每个位置编码维度为$D_v/3$

region to be grounded可以表示为AABB形式边界框的离散tokens序列

边界框的连续角坐标被均匀离散为体素整数，作为location tokens $$。

After adding these additional location tokens, we unfreeze the weights for these tokens in the input and output
embeddings of language models. ?

Experiment

训练在8*A100

推理在8*8 V100

LLMs除了新加的location tokens之外都是冻结的，BLIP-2则是微调QFormer的部分，3D feature为1408-dim

训练时对所有任务的held-in datasets进行混合训练，使用standard language modeling loss

将导航过程建模为对话，在每个time step，都会从观察到的部分场景在线建立3D features，将3D features，agent当前位置，以及历史位置送到3D-LLM中来预测一个3D waypoint（agent下一步应该去的位置），然后用DDPPO来输出low-level action。当到达位置时3D-LLM会预测’stop’.

在HM3D和Habitat上进行实验

Yixuan Pan's Blog

3D-LLM: Injecting the 3D World into Large Language Models

3D-LLM