BuboGPT

BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs

以往MLLMs对于图像给出的是粗粒度的描述和理解，而没有深入研究视觉对象和其它给定模态之间的细粒度关系。BuboGPT在MLLMs中引入visual grounding的能力

Methodology

Visual Grounding Pipeline

tagging module (Recognize Anything Model (RAM))
grounding module (Grounding DINO + Segment Anything Model (SAM))
entity-matching module (使用GPT实现，prompt)

entity-matching module中使用的prompts

# few-shot examples
self.examples = [
(
"<List>['dog', 'sheepdog', 'grass', 'chase sheepdog', 'field', 'field park', 'grassy', 'corgi', 'brown dog', 'brown', 'park']</List>"
"<Text>A brown dog running in the grassy field</Text>",
'brown dog - brown dog\n'
'grassy field - field'
),
(
"<List>['man', 'ride', 'bicycle', 'red', 'passenger train', 'track']</List>"
"<Text>A man riding a bicycle next to a red passenger train on the tracks.</Text>",
"man - man\n"
"bicycle - bicycle\n"
"red passenger train - passenger train\n"
"tracks - track"
)]
# prompt for GPT
self.system_prompt = "You are a helpful assistant. Now I will give you a list of entities and give you a " \
"paragraph or sentence. " \
"you need to first extract the entity given in the text and then" \
"find the corresponding entity having similar or identical meanings in the given list. " \
"Find all the pairs." \
"Are you clear? let us think step by step. " \
"The extracted entities must come from the given text and the corresponding entity must " \
"come from the given list. " \
"If multiple entities can be linked to the same span of text or vice versa, " \
"just keep one and do not merge them." \
"Here is an example: <List>['dog', 'sheepdog', 'grass', 'chase sheepdog', 'field', " \
"'field park', 'grassy', 'corgi', 'brown dog', 'brown', 'park']</List> " \
"<Text>A brown dog running in the grassy field</Text>" \
"The answer is: brown dog — brown dog \n grassy field — field"

aligns with the LLM with a Q-former for each modality
- visual encoder - BLIP2
- audio encoder - ImageBind
- LLM - Vicuna
use a linear projection layer to connect the modality Q-Former with the LLM
two-stage training scheme
- The modality encoders and Vicuna model with be fixed throughout the training procedure
- Stage 1: Single-modal Pre-training
  - 与MiniGPT-4类似，第一阶段的作用是使linear projection layer的输出与LLM的词嵌入空间对齐
  - 基于大量的modality-text paired data对modality Q-Former and linear projection layer进行训练
  - For visual perception, we only train the projection layer for image captioning with the Q-Former from BLIP2 fixed
  - For audio understanding, we jointly train the Q-Former and the projection layer for audio captioning
- Stage 2: Multi-Modal Instruct Tuning
  - 为了使模型适应输入模态的任意组合，作者设计了一个通用prompt

Instruction-Tuning Datasets

Image-Text Dataset

We employ two previously published datasets for visual instruct tuning. The first one is released by MiniGPT-4, which contains 3,439 high-quality text-image pairs. The second one provided by LLaVA [6] is curated from 158K samples based on the COCO dataset, including three types of instructions, i.e., converstaions (58K), detailed description (23K) and complex reasnoning (77K)

BuboGPT

Methodology

Visual Grounding Pipeline

Multi-Modal LLM Training

Instruction-Tuning Datasets