Semantic Abstraction

Semantic Abstraction: Open-World 3D Scene Understanding from 2D Vision-Language Models

基于open-set vocabulary and out-of-domain visual input对3D环境进行感知和推理是机器人在3D世界中进行操作的关键技术，Semantic Abstraction (SemAbs), a framework that equips 2D Vision-Language Models (VLMs) with new 3D spatial capabilities, while maintaining their zero-shot robustness.

赋予2D VLMs 3D空间能力的同时保留2D VLMs的zero-shot robustness

completing partially observed objects
localizing hidden objects from language descriptions

novel vocabulary (i.e., object attributes, synonyms of object nouns)
visual properties (e.g. lighting, textures)
domains (e.g. sim v.s. real)

Motivation

Methodology

Abstraction via Relevancy

输入RGB-D $\mathcal{I} \in \mathbb{R}^{H\times W}$ + object class text label $\mathcal{T}$ (e.g. “biege armchair”)，输出 the 3D occupancy $\mathcal{O}$ for objects of class $\mathcal{T}$
two submodules
- The semantic-aware wrapper (Fig 3c, green background)
  - 计算relevancy map $\in \mathbb{R}^{H\times W}$，每个像素的值表示该像素对$\mathcal{T}$的VLM分类得分的贡献,relevancy map可被视为文本标签的粗略定位
    
    semantic-abstraction/generate_relevancy.py
  - 将relevancy map投影到3D，获得Relevancy Point Cloud $\mathcal{R}^{proj}=\{r_i\}_{i=1}^{H\times W}$（每个点$r_i \in \mathbb{R}^4$, a 3D location with a scalar relevancy value）
- The semantic-abstracted 3D module (Fig 3c, grey background)
  - treats the relevancy point cloud as the localization of a partially observed object and completes it into that object’s 3D occupancy
  - 将$\mathcal{R}^{proj}$体素化，得到3D voxel grid $\mathcal{R}^{vox}\in \mathbb{R}^{D\times 128\times 128\times 128}$
    
    疑问：scatter过程如何获得$D$维特征？原本是3D location + scalar relevancy value
  - 然后将3D Unet作为encoder，对$\mathcal{R}^{vox}$进行特征提取，获得3D特征：
    
    $f_{\text {encode }}\left(\mathcal{R}^{\text {vox }}\right) \mapsto Z \in \mathbb{R}^{D \times 128 \times 128 \times 128}$
  - 可以从$Z$中采样得到任意一个点云$q$位置的特征，然后使用一个MLP进行解码，获得occupancy probability
    
    $f_{\text {decode }}\left(\phi_{q}^{Z}\right) \mapsto o(q) \in[0,1]$
  - $f_{\text {encode }}$和$f_{\text {decode }}$使用3D dataset训练

A Multi-Scale Relevancy Extractor

基于gradcam

空间位置关系dataset（针对单张图像）

behind, left of, right of, in front, on top of, and inside