0%

设原坐标系为A,旋转后的坐标系为B,旋转矩阵R定义为

$R_{B}^{A}=\left[\begin{array}{lll}
\widehat{\mathbf{X}}_{\mathrm{B}} \cdot \widehat{\mathbf{X}}_{\mathrm{A}} & \widehat{\mathbf{Y}}_{\mathrm{B}} \cdot \widehat{\mathbf{X}}_{\mathrm{A}} & \widehat{\mathbf{Z}}_{\mathrm{B}} \cdot \widehat{\mathbf{X}}_{\mathrm{A}} \\
\widehat{\mathbf{X}}_{\mathrm{B}} \cdot \widehat{\mathbf{Y}}_{\mathrm{A}} & \widehat{\mathbf{Y}}_{\mathrm{B}} \cdot \widehat{\mathbf{Y}}_{\mathrm{A}} & \widehat{\mathbf{Z}}_{\mathrm{B}} \cdot \widehat{\mathbf{Y}}_{\mathrm{A}} \\
\widehat{\mathbf{X}}_{\mathrm{B}} \cdot \widehat{\mathbf{Z}}_{\mathrm{A}} & \widehat{\mathbf{Y}}_{\mathrm{B}} \cdot \widehat{\mathbf{Z}}_{\mathrm{A}} & \widehat{\mathbf{Z}}_{\mathrm{B}} \cdot \widehat{\mathbf{Z}}_{\mathrm{A}}
\end{array}\right]$

旋转矩阵中的9个值代表了原坐标系A和旋转后的坐标系B的三个不同轴的单位基向量的点乘,由于是单位向量,因此点乘的结果就是两个轴之间夹角的余弦值。

以第一行为例,这三个余弦值分别代表的B的X,Y,Z轴与A的X轴的余弦值,即这三个轴在AX轴上的投影值。其余弦值的大小,分别代表了这三个轴对于AX轴“贡献”的大小。换句话说,代表了AX轴的“组成”

在三维视觉中,如果R表示相机在世界坐标系中的朝向信息,那么其每行对应于其每个轴在世界坐标系中的朝向。即通常的,第一行表示X轴朝向,第二行表示Y轴朝向,第三行表示Z轴朝向。

1
2
3
4
5
6
7
8
9
10
# 定义
p1 = [x1,y1,z1].T
R = [[r11, r12, r13],
[r21, r22, r23],
[r31, r32, r33]]
p2 = [x2,y2,z2].T

# 变换
p2 = R@p1 = [r11*x1 + r12*y1 + r13*z1, r21*x1 + r22*y1 + r23*z1, r31*x1 + r32*y1 + r33*z1].T

  • 对于一个旋转矩阵,其每一行表示p2的三轴坐标分别由p1的三轴坐标的多少分量组成
  • 其每一列表示p1的三轴坐标分别由p2的三轴坐标的多少分量组成。
  • p2 = R@p1的逆变换结果,即p1 = R.T@p2
Read more »

Pixel2Mesh

Pixel2Mesh: Generating 3D Mesh Models from Single RGB Images

  • Volume和点云缺乏物体表面的细节信息
  • Mesh模型用途更广泛,轻量,且细节丰富

image-20230911225427573

  • 输入单张RGB图像(Input Image),并初始化一个椭球体作为初始三维形状(Ellipsoid Mesh)
  • 全卷积提取图像特征(图中上半部分)
  • 通过Perceptual Feature Pooling来引入图像特征来引导调整图卷积网络中的节点状态
  • 图卷积来表示三维mesh,并对三维mesh不断进行形变,提升角点数量,coarse-to-fine得到最终的输出(图中下半部分)
    • graph uppooling实现图节点数量上采样,获得更加精细的mesh
    • 将3D Mesh中的顶点和边定义为图的节点和边
Read more »

Open Vocabulary 3D Scene Understanding

CVPR2022 PointCLIP

PointCLIP: Point Cloud Understanding by CLIP

第一篇将CLIP用于3D点云的工作,实现了点云分类的zero-shot learning和few-shot learning

Methodology

image-20230705203106639

抽取点云特征

使用投影的方式,将3D的点云投影到几个预定义的平面(6个正交的视角),变为2D图像,具体来说,投影到bottom view就是将坐标为$(x,y,z)$的点投影到图像平面上的$([x/z],[y/z])$,投影得到的图像是缩小的图,近大远小。没有使用卷积将1通道转换为3通道,而是直接将深度$z$作为像素值,并复制到每个通道,得到3通道图像。

Read more »

Semantic Abstraction

Semantic Abstraction: Open-World 3D Scene Understanding from 2D Vision-Language Models

基于open-set vocabulary and out-of-domain visual input对3D环境进行感知和推理是机器人在3D世界中进行操作的关键技术,Semantic Abstraction (SemAbs), a framework that equips 2D Vision-Language Models (VLMs) with new 3D spatial capabilities, while maintaining their zero-shot robustness.

赋予2D VLMs 3D空间能力的同时保留2D VLMs的zero-shot robustness

  • completing partially observed objects
  • localizing hidden objects from language descriptions

image-20230828184909949

  • novel vocabulary (i.e., object attributes, synonyms of object nouns)
  • visual properties (e.g. lighting, textures)
  • domains (e.g. sim v.s. real)

image-20230828185932397

Read more »

BuboGPT

BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs

以往MLLMs对于图像给出的是粗粒度的描述和理解,而没有深入研究视觉对象和其它给定模态之间的细粒度关系。BuboGPT在MLLMs中引入visual grounding的能力

image-20230827171533211

Methodology

Visual Grounding Pipeline

  • tagging module (Recognize Anything Model (RAM))
  • grounding module (Grounding DINO + Segment Anything Model (SAM))
  • entity-matching module (使用GPT实现,prompt)

image-20230827172913922

entity-matching module中使用的prompts

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# few-shot examples
self.examples = [
(
"<List>['dog', 'sheepdog', 'grass', 'chase sheepdog', 'field', 'field park', 'grassy', 'corgi', 'brown dog', 'brown', 'park']</List>"
"<Text>A brown dog running in the grassy field</Text>",
'brown dog - brown dog\n'
'grassy field - field'
),
(
"<List>['man', 'ride', 'bicycle', 'red', 'passenger train', 'track']</List>"
"<Text>A man riding a bicycle next to a red passenger train on the tracks.</Text>",
"man - man\n"
"bicycle - bicycle\n"
"red passenger train - passenger train\n"
"tracks - track"
)]
# prompt for GPT
self.system_prompt = "You are a helpful assistant. Now I will give you a list of entities and give you a " \
"paragraph or sentence. " \
"you need to first extract the entity given in the text and then" \
"find the corresponding entity having similar or identical meanings in the given list. " \
"Find all the pairs." \
"Are you clear? let us think step by step. " \
"The extracted entities must come from the given text and the corresponding entity must " \
"come from the given list. " \
"If multiple entities can be linked to the same span of text or vice versa, " \
"just keep one and do not merge them." \
"Here is an example: <List>['dog', 'sheepdog', 'grass', 'chase sheepdog', 'field', " \
"'field park', 'grassy', 'corgi', 'brown dog', 'brown', 'park']</List> " \
"<Text>A brown dog running in the grassy field</Text>" \
"The answer is: brown dog — brown dog \n grassy field — field"

Multi-Modal LLM Training

  • aligns with the LLM with a Q-former for each modality
    • visual encoder - BLIP2
    • audio encoder - ImageBind
    • LLM - Vicuna
  • use a linear projection layer to connect the modality Q-Former with the LLM

  • two-stage training scheme

    • The modality encoders and Vicuna model with be fixed throughout the training procedure
    • Stage 1: Single-modal Pre-training

      • 与MiniGPT-4类似,第一阶段的作用是使linear projection layer的输出与LLM的词嵌入空间对齐
      • 基于大量的modality-text paired data对modality Q-Former and linear projection layer进行训练
      • For visual perception, we only train the projection layer for image captioning with the Q-Former from BLIP2 fixed
      • For audio understanding, we jointly train the Q-Former and the projection layer for audio captioning
    • Stage 2: Multi-Modal Instruct Tuning

      • 为了使模型适应输入模态的任意组合,作者设计了一个通用prompt

image-20230827225824529

Instruction-Tuning Datasets

  • Image-Text Dataset

We employ two previously published datasets for visual instruct tuning. The first one is released by MiniGPT-4, which contains 3,439 high-quality text-image pairs. The second one provided by LLaVA [6] is curated from 158K samples based on the COCO dataset, including three types of instructions, i.e., converstaions (58K), detailed description (23K) and complex reasnoning (77K)

Review of Large Vision Models and Visual Prompt Engineering

将大型视觉模型应用于特定任务需要一种有效的方法来指导模型的学习和推理过程。这就是视觉提示工程发挥作用的地方。它是一种方法,涉及设计和优化视觉提示,以引导大型模型生成所需的输出。

CLIP on Wheels (CoW)

CoWs on PASTURE: Baselines and Benchmarks for Language-Driven Zero-Shot Object Navigation

For robots to be generally useful, they must be able to find arbitrary objects described by people (i.e., be language-driven) even without expensive navigation training on in-domain data (i.e., perform zero-shot inference).

language-driven zero-shot object navigation (L-ZSON)

PASTURE benchmark considers finding uncommon objects, objects described by spatial and appearance attributes, and hidden objects described relative to visible objects

image-20230808192142973

  • A collection of baseline algorithms, CLIP on Wheels (CoW), which adapt open-vocabulary models to the task of L-ZSON.
  • A new benchmark, PASTURE, to evaluate CoW and future methods on L-ZSON, We consider the ability to find:
    • uncommon objects (e.g., “tie-dye surfboard”)
    • objects by their spatial and appearance attributes in the presence of distractor objects (e.g., “green apple” vs. “red
      apple”)
    • objects that cannot be visually observed(e.g., “mug under the bed”).

Zero-shot object navigation (ZSON)

agents are evaluated on object categories that they are not explicitly trained on

Both algorithms necessitate navigation training for millions of steps and train separate models for each simulation domain.

[44] Zson: Zero-shot object-goal navigation using multimodal goal embeddings

[37] Simple but effective: Clip embeddings for embodied ai

CLIP on Wheels (CoW) Baselines

image-20230809212239190

Depth-based Mapping

0.125m分辨率的地图

image-20230809213210939

Exploration

  • Frontier based exploration (FBE)
    • move to the frontier between free and unknown space to discover new regions. Once the navigator reaches a frontier, it moves greedily to the next closest frontier.
  • Learnable exploration

Object Localization

  • CLIP with k referring expressions
  • CLIP with k image patches
  • CLIP with gradient relevance
  • MDETR segmentation
  • OWL-ViT detection
  • Post-processing
  • Target driven planning

The PASTURE Benchmark

PASTURE builds on ROBOTHOR validation scenes

PASTURE evaluates seven core L-ZSON capabilities

  • Uncommon objects.

    Traditional benchmarks (e.g.,ROBOTHOR and HABITAT MP3D)主要在常见类别上进行评估,但家用环境中物体是非常多样的,因此作者添加了12 new objects to each room.

    We use names shown in Fig. 4 as instance labels, which are minimal descriptions to identify each object. Some identifiers refer to text in images (e.g., “whiteboard saying CVPR”) or to appearance attributes (e.g., “wooden toy airplane”). Other objects are less common in North America, like “mat´e”, which is a popular Argentinian drink.

    image-20230809222604449

  • Appearance descriptions.

    对visual attributes的感知能力

    “{size}, {color}, {material} {object}”. For example: “small, red apple”, “orange basketball”, “small, black, metallic alarm clock”.

  • Spatial descriptions.

    对spatial information的感知能力

    “{object} on top of {x}, near {y}, {z}, …”. For example, “house plant on a dresser near a spray bottle”. To determine [on top of] relations, we use THOR metadata and to determine [nearness] we use a distance threshold between pairs of objects.

  • Appearance descriptions with distractors.

    有干扰物出现的情况

    For example, for the task of finding a “red apple”, we have both a red apple and a green apple in the room.

  • Spatial descriptions with distractors.

    有干扰物出现

  • Hidden object descriptions.

    An ideal object navigator should find objects, even when they are hidden.

    “{object} under/in {x}”. For example, “basketball in the dresser drawers” or “vase under the sofa”. We sample large objects (e.g., beds, sofas, dressers) in each scene to determine [under/in] relations. Additionally we remove visible instances of {object} from the room.

  • Hidden object descriptions with distractors.

    Consider finding a “mug under the bed”. A distractor mug will also appear in the scene making the task more challenging.

Experiments

CVPR2021 Topological Planning with Transformers for Vision-and-Language Navigation

允许目标(agent)在执行 VLN 任务之前使用探索轨迹(图 2)探索环境。 这模仿了更接近现实世界的设置,在这些设置中,agent必须自己建立对室内环境的理解,而不是被交给预定义的地图。

image-20230808135101288

image-20230808135141881

Methodology

Overview

  • 从探索阶段开始,agent构建环境的拓扑图,然后agent使用对环境的这种拓扑理解以及指令文本和当前观察来执行 VLN 任务。

  • 为了执行 VLN 任务,作者从机器人社区中汲取灵感并使用模块化方法(modular approach)。具体来说,我们将本方法分为planning和control。

    • 在planning阶段,采取任何行动步骤之前,agent首先基于环境和导航指令生成一个可解释的global navigation plan。
    • The robot then tries to follow the navigation plan using a repeating cycle of localization and control until it reaches the destination。
    • 使用分层控制器首先预测waypoint subgoals
    • 然后将waypoints转换为低级动作,例如 FORWARD(0.25m) 或 ROTATE(15°)
Read more »

USA-Net

USA-Net: Unified Semantic and Affordance Representations for Robot Memory

the first end-to-end differentiable planner optimizes for both semantics and affordance in a single implicit map

“take this can to the kitchen”, 机器人想要完成这样一个指令,需要两方面信息:

  • semantic information:需要知道厨房的位置
  • affordance information:进行轨迹规划,避免和场景中的其他物体碰撞

Motivation

  • Semantic information can then be used to find the location of objects for planning, 但没有考虑应该采取什么样的轨迹才能在不与障碍物相撞的情况下到达目标

  • in the case of a learned representation, we instead see a softer distribution of potential goals, where there may be tradeoff between trajectory planning and goal selection

    基于学习到的表征来推断目标位置往往不能获得确切位置,而是一些潜在可能的目标位置,最佳语义位置可能被障碍物阻挡或以其他方式不可达,因此需要权衡可达性和目标选择

  • course-grained discrete motion planners, such as occupancy grids which use large cell sizes to conserve memory,可能无法捕获细粒度语义信息,诸如场景的小细节。

    要执行命令“从冰箱里拿一杯苏打水”,机器人必须能够在环境中的粗粒度障碍物周围导航,而且还能与相对细粒度的苏打水罐进行交互。

How can we use RGB and depth data to capture a representation which can be used for both semantic goal navigation and motion planning?

Read more »

BEVBert

BEVBert: Multimodal Map Pre-training for Language-guided Navigation

Motivation

Most existing pre-training methods employ discrete panoramas to learn visual-textual associations. This requires the model to implicitly correlate incomplete, duplicate observations within the panoramas, which may impair an agent’s spatial understanding.

如下图所示,单个视图中不完整的观察和跨视图的重复观察可能会使agent感到困惑, discrete panoramas require implicit spatial modeling and may hamper the learning of generic language-environment correspondence

it is difficult to infer “the second bedroom opposite to the bookcase” because there are duplicate images of “bedroom” and “bookcase” across different views, and therefore it is hard to tell they are images for the same object or multiple instances.

一个潜在的解决方案是将这些观测值投影到一个统一的地图中,该地图明确地聚合不完整的观测值并删除重复的观测值, 这种方案与预训练的结合尚属空白,本文对此进行了首次探索

image-20230807163419112

Metric map代表

[16] Object goal navigation using goal-oriented semantic exploration

缺点:存储和计算inefficiencies

The metric map uses dense grid features to precisely describe the environment but has inefficiencies of scale, As a result, using a large map to capture the long-horizon navigation dependency can cause prohibitive computation, especially for the computation-intensive pre-training

Topo map代表

[17] Neural topological slam for visual navigation

优点:

  • the topo map can efficiently capture dependency by keeping track of visited locations as a graph structure
  • It also allows the agent to make efficient long-term goal plans, such as backtracking to a previous location

缺点:压缩的特征表示缺少了局部的细粒度信息

  • However, each node in the graph is typically represented by condensed feature vectors, which lack fine-grained information for local spatial reasoning

本文没有采用large global metric map,而是提出了一种混合方法,结合metric map和topo map。

  • local metric map for short-term spatial reasoning

  • conducting overall long-term action plans on a global topo map

Read more »