閱讀筆記：XGPT: Cross-modal Generative Pre-Training for Image Captioning

Araloak發表於2020-12-14

原文網址 : https://blog.csdn.net/weixin_43874380/article/details/111184800

XGPT: Cross-modal Generative Pre-Training for Image Captioning

在這裡插入圖片描述

Contribution

現有大多數VL pre-trained models基本都是Transformer-Encoder結構的，不適用於Vision-and-language generation tasks，因為：

On one hand, pre-trained models developed for understanding tasks only provides the encoder. To support generation tasks, separate decoders have to be trained, like the methods proposed by VideoBERT and CBT. On the other hand, existing VL pre-training objectives are almost all related to the masked region or span prediction, including VLP. None of the pre-training tasks is designed for the whole sentence
generation.
本文提出一個encoder-decoder的generative式的VL model。其中，共享decoder和encoder的引數，在decoder那側，decoder block中 self-attention 和 encoder-decoder attention 引數也是shared。
提出三個新的預訓練任務(加上原來的Image Captioning一共四個)：
1. Image Captioning (IC)
  
  輸入image regions，generative的生成 caption
2. Image-conditioned Masked Language Modeling (IMLM)
  
  預測被masked的連續的tokens 。和BERT的MLM的區別在於：
  
  the decoder has to generate masked tokens of the fragment, and extract useful image-conditioned information from the encoder side
3. Image-conditioned Denoising Autoencoding (IDA)
  
  輸入encoder端的一個被masked 的token fragment只用一個【MASK】來標記，所以模型需要在不知道masked fragment的長度的情況下在decoder端對原句進行重建。decoder的時候是有ground truth來進行teacher-forcing的
  
  為了實現text-image token的對齊，計算一個attention矩陣 $A$ 。
使用A以後每個word token就加權結合了每個region的資訊，之後再送入encoder對masked fragment進行生成。
1. Text-conditioned Image Feature Generation (TIFG)
  
  TIFG aims to regress the decoder output of all image regions conditioned on text descriptions rather than only the masked regions.
  
  encoder部分只輸入text，然後在decoder對所有生成的與image vector相同緯度的向量取平均，再與ground truth 計算MSE。

Minor Concern

The probability of each target token is estimated by the decoder given the cross-attention performed over the final hidden layer of the encoder.

cross-attention在哪體現了？是不是就是指最後一個encoder傳過去的K和V
Text-conditioned Image Feature Generation (TIFG)這個預訓練任務中，只在encoder端輸入text是如何在decoder端得到image的語義向量的？如果是在decoder端先輸入一個白板然後generate的話，考慮到image 天然沒有order，那要怎麼進行teacher-forcing呢？

image squence中region之間的順序也許是人為定義實現的（從上到下，從左至右）

CIAGAN: Conditional Identity Anonymization Generative Adversarial Networks閱讀筆記
2022-05-15
IDE筆記
《Hierarchical Text-Conditional Image Generation with CLIP Latents》閱讀筆記
2022-11-24
筆記
Image Super-Resolution Using DeepConvolutional Networks論文閱讀筆記
2021-01-04
筆記
閱讀筆記
2024-10-25
筆記
閱讀筆記4
2024-05-12
筆記
閱讀筆記3
2024-05-12
筆記
閱讀筆記5
2024-05-07
筆記
【閱讀筆記：字典】
2019-07-01
筆記
閱讀筆記2
2024-11-18
筆記
閱讀筆記1
2024-10-28
筆記
閱讀筆記8
2024-06-12
筆記
閱讀筆記03
2024-06-11
筆記
閱讀筆記02
2024-06-11
筆記
閱讀筆記7
2024-06-02
筆記
gdbOF閱讀筆記
2023-02-27
筆記
GoogleNet閱讀筆記
2020-12-16
Go筆記
JDK原始碼閱讀：String類閱讀筆記
2021-10-04
JDK原始碼筆記
JDK原始碼閱讀：Object類閱讀筆記
2021-09-18
JDK原始碼Object筆記
SiamRPN++閱讀筆記
2019-01-29
筆記
Flownet 2.0 閱讀筆記
2018-12-05
筆記
《Clean Code》閱讀筆記
2020-04-07
筆記
《潮騷》閱讀筆記Ⅱ
2024-11-23
筆記
閱讀影片方法筆記
2024-06-26
筆記
Dependencies for Graphs 閱讀筆記
2020-12-25
筆記
Keys for graphs閱讀筆記
2020-12-25
筆記
JDK原始碼閱讀(7)：ConcurrentHashMap類閱讀筆記
2021-11-25
JDK原始碼HashMap筆記
JDK原始碼閱讀(5)：HashTable類閱讀筆記
2021-11-09
JDK原始碼筆記
JDK原始碼閱讀(4)：HashMap類閱讀筆記
2021-10-10
JDK原始碼HashMap筆記
《Effective DevOps》閱讀筆記 82
2018-11-13
dev筆記
Koa 原始碼閱讀筆記
2018-09-30
原始碼筆記
《Effective DevOps》閱讀筆記 59
2018-08-27
dev筆記
《Effective DevOps》閱讀筆記 19
2018-06-28
dev筆記
MapReduce 論文閱讀筆記
2020-06-24
筆記
The Data Warehouse Toolkit 閱讀筆記
2020-09-03
筆記
CopyOnWriteArrayList原始碼閱讀筆記
2020-08-17
原始碼筆記
ArrayList原始碼閱讀筆記
2020-08-18
原始碼筆記
LinkedList原始碼閱讀筆記
2020-08-20
原始碼筆記
《影響力》閱讀筆記
2024-10-10
筆記

閱讀筆記：XGPT: Cross-modal Generative Pre-Training for Image Captioning

XGPT: Cross-modal Generative Pre-Training for Image Captioning

Contribution

Minor Concern

相關文章