0%

BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs

BuboGPT

BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs

以往MLLMs对于图像给出的是粗粒度的描述和理解,而没有深入研究视觉对象和其它给定模态之间的细粒度关系。BuboGPT在MLLMs中引入visual grounding的能力

image-20230827171533211

Methodology

Visual Grounding Pipeline

  • tagging module (Recognize Anything Model (RAM))
  • grounding module (Grounding DINO + Segment Anything Model (SAM))
  • entity-matching module (使用GPT实现,prompt)

image-20230827172913922

entity-matching module中使用的prompts

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# few-shot examples
self.examples = [
(
"<List>['dog', 'sheepdog', 'grass', 'chase sheepdog', 'field', 'field park', 'grassy', 'corgi', 'brown dog', 'brown', 'park']</List>"
"<Text>A brown dog running in the grassy field</Text>",
'brown dog - brown dog\n'
'grassy field - field'
),
(
"<List>['man', 'ride', 'bicycle', 'red', 'passenger train', 'track']</List>"
"<Text>A man riding a bicycle next to a red passenger train on the tracks.</Text>",
"man - man\n"
"bicycle - bicycle\n"
"red passenger train - passenger train\n"
"tracks - track"
)]
# prompt for GPT
self.system_prompt = "You are a helpful assistant. Now I will give you a list of entities and give you a " \
"paragraph or sentence. " \
"you need to first extract the entity given in the text and then" \
"find the corresponding entity having similar or identical meanings in the given list. " \
"Find all the pairs." \
"Are you clear? let us think step by step. " \
"The extracted entities must come from the given text and the corresponding entity must " \
"come from the given list. " \
"If multiple entities can be linked to the same span of text or vice versa, " \
"just keep one and do not merge them." \
"Here is an example: <List>['dog', 'sheepdog', 'grass', 'chase sheepdog', 'field', " \
"'field park', 'grassy', 'corgi', 'brown dog', 'brown', 'park']</List> " \
"<Text>A brown dog running in the grassy field</Text>" \
"The answer is: brown dog — brown dog \n grassy field — field"

Multi-Modal LLM Training

  • aligns with the LLM with a Q-former for each modality
    • visual encoder - BLIP2
    • audio encoder - ImageBind
    • LLM - Vicuna
  • use a linear projection layer to connect the modality Q-Former with the LLM

  • two-stage training scheme

    • The modality encoders and Vicuna model with be fixed throughout the training procedure
    • Stage 1: Single-modal Pre-training

      • 与MiniGPT-4类似,第一阶段的作用是使linear projection layer的输出与LLM的词嵌入空间对齐
      • 基于大量的modality-text paired data对modality Q-Former and linear projection layer进行训练
      • For visual perception, we only train the projection layer for image captioning with the Q-Former from BLIP2 fixed
      • For audio understanding, we jointly train the Q-Former and the projection layer for audio captioning
    • Stage 2: Multi-Modal Instruct Tuning

      • 为了使模型适应输入模态的任意组合,作者设计了一个通用prompt

image-20230827225824529

Instruction-Tuning Datasets

  • Image-Text Dataset

We employ two previously published datasets for visual instruct tuning. The first one is released by MiniGPT-4, which contains 3,439 high-quality text-image pairs. The second one provided by LLaVA [6] is curated from 158K samples based on the COCO dataset, including three types of instructions, i.e., converstaions (58K), detailed description (23K) and complex reasnoning (77K)