0%

3D-LLM: Injecting the 3D World into Large Language Models

3D-LLM

3D-LLM: Injecting the 3D World into Large Language Models

MLLMs如Flamingo,BLIP-2能够对2D image进行理解和推理,但是无法与3D物理世界接地,因此无法利用3D中的丰富信息,如空间关系,物理和交互等。

通过将场景的3D表示作为输入,使3D-LLM具有2个优势:

  • 关于整个场景的长期记忆可以存储在整体3D表示中,而不是片段的部分视图观察
  • 3D属性(如affordances和空间关系)可以从3D表示中推理出来,远超出了基于语言或基于2D图像的LLM的能力范围

本文的贡献点:

  • 3D data paired with language description难以获取,本文提出了一个数据生成pipeline来生成大规模的3D data paired with language,最终得到300k的3D-language data(多任务)
  • 如何提取能和语言特征对齐的3D特征?
    • 使用渲染的2D多视图图像的特征构造3D特征
    • 将3D转换为2D,然后就可以使用现成的VLMs如BLIP-2
  • 如何感知3D空间位置关系
    • 3D localization mechanism that bridges the gap between language and spatial locations

image-20230725122839668

  • perceiver[2]
  • QFormers[31]

  • 3D and language[5, 49, 8, 20, 1, 15, 24, 49, 3, 21, 19]

Methodology

使用LLM来完成3D-language data的收集

使用text-only GPT来产生数据的3种prompt

  • boxes-demonstration-instruction based prompting:
    • 输入3D场景中房间物体的bounding boxes(AABB,坐标轴对齐), 以提供场景的语义和空间位置信息,使用不同指令来生成多种数据,并使用0-3个few-shot demonstration examples
    • bbox的表示?坐标轴定义?需要看真实的prompt模板(appendix)
  • ChatCaptioner based prompting:
    • 将不同视角的图像输入BLIP-2,然后ChatGPT提问,BLIP-2回答
    • [52] Chatgpt asks, blip-2 answers: Automatic questioning towards enriched visual descriptions, 2023.
    • 例子中的far right,near left和多视角图像的对应关系?
  • Revision based prompting:
    • 用于不同类型3D data间的转换,比如有了caption,基于caption进行QA

image-20230725125010914

用到的主要数据集:

  • Objaverse:使用ChatCaptioner来对Objaverse生成3D-related descriptions,800K 3D objects
  • Scannet:1k 3D indoor scenes,It provides semantics and bounding boxes of the objects in the scenes.
  • Habitat-Matterport (HM3D):HM3DSem [47] further adds semantic annotations and bounding boxes for more than 200 scenes of HM3D.

3D-LLM

image-20230725133242843

3D Feature Extractor

建立与language features对齐的3D features

  • 首先从渲染的图片中提取pixel-aligned dense features(follow [26]Conceptfusion: Open-set multimodal 3D mapping)

  • 然后使用三种方法来从渲染图像的特征建立3D特征

    • Direct Reconstruction:利用ground-truth camera matrixes直接从rgbd图像(3D data渲染得到的)重建点云,特征被直接映射到3D点云上

      This method is suitable for rendered rgbd data with perfect camera poses and intrinsics

    • Feature Fusion:使用gradslam将2D feature映射到3D maps中,3D maps中融合了深度,颜色和特征

      • [26] Conceptfusion: Open-set multimodal 3D mapping
      • [28] gradslam: Dense slam meets automatic differentiation. arXiv, 2020.

      This method is suitable for 3D data with noisy depth map renderings, or noisy camera poses and intrinsics

    • Neural Field:使用neural voxel field来建立3D表示,每个voxel都具有desity,color和feature,然后使用MSE loss对齐射线中的3D特征和像素中的2D特征。

      • [20] 3D concept learning and reasoning from multi-view images, 2023.

      This method is for 3D data with RGB renderings but no depth data, and noisy camera poses and intrinsics

  • 通过这些方法,就可以得到$$的3D场景表示($N$个点,每个点的特征为$D_v$维)

Training 3D-LLMs

使用2D VLMs来作为backbone作为初始化,而不是从头训练

使用Perceiver架构来处理不同模态的桥接(Perceiver可以处理任意input size的数据),将3D点云数据视作打平的图像数据,送入perceiver,输出固定的尺寸的特征

  • Perceiver: General perception with iterative attention. In International Conference on Machine Learning, 2021.

we use pretrained 2D VLMs as our backbones, input the aligned 3D features to train 3D-LLMs with our collected 3D-language dataset

3D Localization Mechanism

将xyz三个维度的sin/cos位置编码concat到3D features中,每个位置编码维度为$D_v/3$

region to be grounded可以表示为AABB形式边界框的离散tokens序列

边界框的连续角坐标被均匀离散为体素整数,作为location tokens $$。

After adding these additional location tokens, we unfreeze the weights for these tokens in the input and output
embeddings of language models. ?

Experiment

训练在8*A100

推理在8*8 V100

LLMs除了新加的location tokens之外都是冻结的,BLIP-2则是微调QFormer的部分,3D feature为1408-dim

训练时对所有任务的held-in datasets进行混合训练,使用standard language modeling loss

pipeline

Object Navigation

将导航过程建模为对话,在每个time step,都会从观察到的部分场景在线建立3D features,将3D features,agent当前位置,以及历史位置送到3D-LLM中来预测一个3D waypoint(agent下一步应该去的位置),然后用DDPPO来输出low-level action。当到达位置时3D-LLM会预测’stop’.

在HM3D和Habitat上进行实验

image-20230725170515014