論文筆記[4] GPT 1-3 梳理和對比

姓菜名雞發表於2020-12-22

論文題目:
Improving Language Understanding by Generative Pre-Training
Language Models are Unsupervised Multitask Learners
Language Models are Few-Shot Learners
論文傳送門: GPT1 GPT2 GPT3
論文團隊:OpenAI

GPT-1

Motivation & Objectives

  • most SOTA NLP models were trained specifically on a particular task

    • Limitations:
      ① need large amount of annotated data, not easily available
      ② fail to generalize
  • 2 Challenges:
    ① which optimization objectives are most effective?
    ② what’s the most efficient way to transfer learned representations to target task?

    ⇒ \Rightarrow A semi-supervised approach using a combination of unsupervised pre-training and supervised fine-tuning
    ⇒ \Rightarrow To learn a universal representation that transfers with little adaptation to a wide range of tasks

  • learning a generative language model using unlabeled data and then fine-tuning the model by providing examples of specific downstream tasks

  1. 不清楚要使用哪種optimization objectives在學習到用於遷移的表示中最有用
  2. 哪種方式來transfer這些learned representations to the target task還沒達成共識(a combination of making task-specific changes to the model architecture,using intricate learning schemes,adding auxiliary(輔助的) learning objectives)

Framework

Unsupervised Language Modeling (Pre-training):

在這裡插入圖片描述
在這裡插入圖片描述
- multi-layer Transformer decoder (N = 12)
- NO Encoder-Decoder Attention Layer
在這裡插入圖片描述
在這裡插入圖片描述
在這裡插入圖片描述
在這裡插入圖片描述

  1. 去掉Encoder-Decoder Attention層作為模型的主體,然後將decoder的輸出經過一個softmax層,來產生目標詞的輸出分佈。

  2. h n h_{n} hn可以理解為是輸出層對詞彙表中各個詞語的注意力權重;而 h n W e h_{n}W_{e} hnWe就是輸出層對各個token的注意力大小。經過預訓練的GPT中,儲存了從語料中學習到的語義和語法資訊。

Supervised Fine-Tuning

在這裡插入圖片描述

  • objective to maximize
    在這裡插入圖片描述
    在這裡插入圖片描述
  • Advatages:
    ① improving generalization of the supervised model
    ② accelerating convergence

λ \lambda λ是超引數,設定為0.5

在這裡插入圖片描述

Experiment

  • Dataset: BooksCorpus (7000 unpublished books, unseen data) large stretches of contiguous text, which helped the model learn large range dependencies
  • Unsupervised Training:
    • Byte Pair Encoding (BPE) vocabulary with 40,000 merges was used
    • 768-dimensional state for encoding tokens
    • 12 layered model, 12 attention heads
    • position-wise feed forward layer: 3072-dimensional
    • 117M parameters
  • Supervised Fine-tuning:
    • 3 epochs for most of the downstream tasks → already learnt a lot during pre-training. Thus, minimal fine-tuning was enough
    • Most of the hyperparameters from unsupervised pre-training were used for fine-tuning
      在這裡插入圖片描述
      在這裡插入圖片描述
      在這裡插入圖片描述
  • Works well across datasets of different sizes, from smaller datasets such as STS-B (5.7k training examples) – to the largest one – SNLI (550k training examples)

Discussion

  • Impact of number of layers transferred

  • Evolution of zero-shot performance on different tasks as a function of LM pre-training updates
    在這裡插入圖片描述
    在這裡插入圖片描述

    • transformer block的個數越多,也就是語言模型越深,效果越好,說明語言模型的各個layer確實學到了不一樣的東西
    • 不對語言模型進行fine-tuning時,pretrain語言模型的迭代次數越多,最後的效果越好,說明語言模型的pretrain確實學到了general的東西
  • Ablation studies
    在這裡插入圖片描述

  • larger datasets benefit from the auxiliary objective but smaller datasets do not

  • 5.6 average score drop when using LSTM (single layer 2048 unit LSTM)

  • lack of pre-training hurts performance across all the tasks, resulting in a 14.8% decrease

Conclusion

  • GPT-1 performed SOTA in 9 out of 12 tasks

  • decent zero-shot performance on various tasks

  • GPT-1 proved that LM served as an effective pre-training objective. The architecture facilitated transfer learning and could perform various NLP tasks with very little fine-tuning.

GPT-2

Main Idea

  • Learning Objectives & Concepts

    • learning multiple tasks using the same unsupervised model (×supervised fine-tuning)
    • objective: P(output|input) → P(output|input, task) [task conditioning]
  • Zero Shot Learning and Zero Shot Task Transfer

    • no examples are provided and the model understands the task based on the given instruction
    • E.g. (translate to french, english text, french text)
  • LM = Unsupervised Multitask Learning

    • the supervised output is a subset of the language model sequence
    • E.g1. “The translation of word Machine Learning in chinese is 機器學習.”
    • E.g2. “The President of the United States is Trump.”

相比於有監督的多工學習,語言模型只是不需要顯示地定義哪些欄位是要預測的輸出,所以,實際上有監督的輸出只是語言模型序列中的一個子集

Model Architecture

在這裡插入圖片描述

  • GPT-2 has 1.5 billion parameters [GPT-1 (117M parameters)]
  • 48 layers, 1600-dimension
  • Larger vocabulary of 50,257 tokens
  • Larger batch size of 512 and larger context window of 1024 tokens
  • Layer normalisation was moved to input of each sub-block and an additional layer normalisation was added after final self-attention block
  • At initialisation, the weight of residual layers was scaled by 1/√N, where N was the number of residual layers
    在這裡插入圖片描述
    在這裡插入圖片描述
    scaled by 1/√N是因為殘差層的引數初始化根據網路深度進行調節。

Dataset & Experiment

  • WebText, had 40GB of text data from over 8 million documents (removed Wikipedia)
    在這裡插入圖片描述
  • In French to English translation task, GPT-2 performed better than most unsupervised models in zero shot setting but did not outperform the SOTA unsupervised model
  • GPT-2 could not perform well on text summarization and its performance was similar or lesser than classic models trained for summarization

Generalization vs Memorization

最近的計算機視覺研究表明,影像資料集通常都會包含一些類似的影像,例如CIFAR-10在訓練集與測試集中就有3.3%的重複資料,這導致了對機器學習的泛化效能被過度高估。

  • Overlapping → over-reporting of the generalization performance of machine learning systems
  • Bloom filters containing 8-grams of WebText training set tokens
    在這裡插入圖片描述
  • recommend the use of n-gram overlap based de-duplication as an important verification step and sanity check during the creation of training and test splits for new NLP datasets.
  • performance on training and test are similar and improve together as model size is increased → underfitting on WebText in many ways
    在這裡插入圖片描述

Summary

  • achieve SOTA results on 7 out of 8 tested language modelling datasets in zero-shot
  • larger dataset & more parameters improved the capability of LM to understand tasks
  • How to use:

在這裡插入圖片描述

GPT-3

Introduction

  • Limitation: although task-agnostic, still a need for task-specific datasets and fine-tuning

  • BERT, etc:
    ① Excessive reliance on supervised data in the field
    ② Overfitting to the data distribution
    ⇒ \Rightarrow Focusing on more general NLP model
    ⇒ \Rightarrow Less supervised data, no fine-tuning.

  • Concepts

    • In-context learning: Large language models develop pattern recognition and other skills using the text data they are trained on.
    • Few-shot, one-shot and zero-shot setting: capability ↑ as capacity ↑

在這裡插入圖片描述
在這裡插入圖片描述

Model and Implementation details

  • GPT-3 has 96 layers with each layer having 96 attention heads.
  • Size of word embeddings: 1600 for GPT-2 → 12888 for GPT-3
  • Context window size: 1024 for GPT-2 → 2048 tokens for GPT-3
  • Alternating dense and locally banded sparse attention patterns
    在這裡插入圖片描述
    在這裡插入圖片描述
    • sparse attention:
      在這裡插入圖片描述
      使用交替的密集和區域性帶狀的稀疏注意模式(沒說細節,參考論文Generating Long Sequences with Sparse Transformers). Sparse Transformer只關注k個貢獻最大的狀態。通過顯式選擇,只關注少數幾個元素,與傳統的注意方法相比,對於與查詢不高度相關的值將被歸0。

Experiment

  • Dataset (45TB): trained on a mix of 5 different corpora, each having certain weight assigned to it. High quality datasets were sampled more often, and model was trained for more than 1 epoch
    • downloaded and filtered a version of CommonCrawl
    • fuzzy deduplication
    • added high-quality reference corpora
      在這裡插入圖片描述
      在這裡插入圖片描述
      在這裡插入圖片描述
      在這裡插入圖片描述

Discussion & Broader Impacts

在這裡插入圖片描述

  • losing coherency while formulating long sentences and repeats sequences
  • does not perform very well on tasks like, fill in the blanks, some reading comprehension tasks etc.
  • Unidirectionality?
  • lacks the notion of task or goal-oriented prediction of tokens, suggests:
    • augmentation of learning objective, use of reinforcement learning to fine tune models, etc.
  • complex & costly, heavy architecture, less interpretability

  • misuse of its human-like text generating capability for phishing, spamming, spreading misinformation
  • gender, ethnicity, race & religion bias

Conclusion

  • BIGGER

  • Under Few-shot setting, it surpasses the current Fine-tuning SOTA on some NLU tasks

  • performs well on downstream NLP tasks in zero-shot and few-shot setting: writing articles, summing up numbers, writing codes, etc.

  • Most impressive: generalization


幾個GPT-3的demo:


Ref:
[1] GPT-1論文翻譯
[2] 【論文筆記】GPT-1:Improving Language Understanding by Generative Pre-Training
[3] GPT——生成式預訓練Transformer
[4] The Journey of Open AI GPT models
[5] OpenAI GPT2原理解讀
[6] GPT2.0 Language Models are Unsupervised Multitask Learners 論文解讀
[7] 上車!帶你一文了解GPT-2模型(transformer語言模型視覺化)
[8] 總結GPT1和GPT2
[9] 直觀理解 GPT-2 語言模型並生成金庸武俠小說

相關文章