閱讀筆記:XGPT: Cross-modal Generative Pre-Training for Image Captioning

Araloak發表於2020-12-14

XGPT: Cross-modal Generative Pre-Training for Image Captioning

在這裡插入圖片描述

Contribution

  • 現有大多數VL pre-trained models基本都是Transformer-Encoder結構的,不適用於Vision-and-language generation tasks,因為:

    On one hand, pre-trained models developed for understanding tasks only provides the encoder. To support generation tasks, separate decoders have to be trained, like the methods proposed by VideoBERT and CBT. On the other hand, existing VL pre-training objectives are almost all related to the masked region or span prediction, including VLP. None of the pre-training tasks is designed for the whole sentence
    generation.

  • 本文提出一個encoder-decoder的generative式的VL model。其中,共享decoder和encoder的引數,在decoder那側,decoder block中 self-attention 和 encoder-decoder attention 引數也是shared。

  • 提出三個新的預訓練任務(加上原來的Image Captioning一共四個):

    1. Image Captioning (IC)

      輸入image regions,generative的生成 caption

    2. Image-conditioned Masked Language Modeling (IMLM)

      預測被masked的連續的tokens 。和BERT的MLM的區別在於:

      the decoder has to generate masked tokens of the fragment, and extract useful image-conditioned information from the encoder side

    3. Image-conditioned Denoising Autoencoding (IDA)

      輸入encoder端的一個被masked 的token fragment只用一個【MASK】來標記,所以模型需要在不知道masked fragment的長度的情況下在decoder端對原句進行重建。decoder的時候是有ground truth來進行teacher-forcing的

      為了實現text-image token的對齊,計算一個attention矩陣 A A A

    在這裡插入圖片描述

    使用A以後每個word token就加權結合了每個region的資訊,之後再送入encoder對masked fragment進行生成。

    1. Text-conditioned Image Feature Generation (TIFG)

      TIFG aims to regress the decoder output of all image regions conditioned on text descriptions rather than only the masked regions.

      encoder部分只輸入text,然後在decoder對所有生成的與image vector相同緯度的向量取平均,再與ground truth 計算MSE。

Minor Concern

The probability of each target token is estimated by the decoder given the cross-attention performed over the final hidden layer of the encoder.

  • cross-attention在哪體現了?是不是就是指最後一個encoder傳過去的K和V

  • Text-conditioned Image Feature Generation (TIFG)這個預訓練任務中,只在encoder端輸入text是如何在decoder端得到image的語義向量的?如果是在decoder端先輸入一個白板然後generate的話,考慮到image 天然沒有order,那要怎麼進行teacher-forcing呢?
    在這裡插入圖片描述
    image squence中region之間的順序也許是人為定義實現的(從上到下,從左至右)

相關文章