論文筆記[4] GPT 1-3 梳理和對比

姓菜名雞發表於2020-12-22

原文網址 : https://blog.csdn.net/weixin_40267501/article/details/111503356

論文題目：
Improving Language Understanding by Generative Pre-Training
Language Models are Unsupervised Multitask Learners
Language Models are Few-Shot Learners
論文傳送門： GPT1 GPT2 GPT3
論文團隊：OpenAI

GPT-1

Motivation & Objectives

most SOTA NLP models were trained specifically on a particular task
- Limitations:
  ① need large amount of annotated data, not easily available
  ② fail to generalize
2 Challenges:
① which optimization objectives are most effective?
② what’s the most efficient way to transfer learned representations to target task?

$\Rightarrow$ A semi-supervised approach using a combination of unsupervised pre-training and supervised fine-tuning
$\Rightarrow$ To learn a universal representation that transfers with little adaptation to a wide range of tasks
learning a generative language model using unlabeled data and then fine-tuning the model by providing examples of specific downstream tasks

不清楚要使用哪種optimization objectives在學習到用於遷移的表示中最有用
哪種方式來transfer這些learned representations to the target task還沒達成共識（a combination of making task-specific changes to the model architecture，using intricate learning schemes，adding auxiliary(輔助的) learning objectives）

Framework

Unsupervised Language Modeling (Pre-training):

在這裡插入圖片描述

- multi-layer Transformer decoder (N = 12)
- NO Encoder-Decoder Attention Layer

去掉Encoder-Decoder Attention層作為模型的主體，然後將decoder的輸出經過一個softmax層，來產生目標詞的輸出分佈。
$h_{n}$ 可以理解為是輸出層對詞彙表中各個詞語的注意力權重；而 $h_{n}W_{e}$ 就是輸出層對各個token的注意力大小。經過預訓練的GPT中，儲存了從語料中學習到的語義和語法資訊。

Supervised Fine-Tuning

在這裡插入圖片描述

objective to maximize
Advatages:
① improving generalization of the supervised model
② accelerating convergence

$\lambda$ 是超引數，設定為0.5

在這裡插入圖片描述

Experiment

Dataset: BooksCorpus (7000 unpublished books, unseen data) large stretches of contiguous text, which helped the model learn large range dependencies
Unsupervised Training:
- Byte Pair Encoding (BPE) vocabulary with 40,000 merges was used
- 768-dimensional state for encoding tokens
- 12 layered model, 12 attention heads
- position-wise feed forward layer: 3072-dimensional
- 117M parameters
Supervised Fine-tuning:
- 3 epochs for most of the downstream tasks → already learnt a lot during pre-training. Thus, minimal fine-tuning was enough
- Most of the hyperparameters from unsupervised pre-training were used for fine-tuning
Works well across datasets of different sizes, from smaller datasets such as STS-B (5.7k training examples) – to the largest one – SNLI (550k training examples)

Discussion

Impact of number of layers transferred
Evolution of zero-shot performance on different tasks as a function of LM pre-training updates
- transformer block的個數越多，也就是語言模型越深，效果越好，說明語言模型的各個layer確實學到了不一樣的東西
- 不對語言模型進行fine-tuning時，pretrain語言模型的迭代次數越多，最後的效果越好，說明語言模型的pretrain確實學到了general的東西
Ablation studies
larger datasets benefit from the auxiliary objective but smaller datasets do not
5.6 average score drop when using LSTM (single layer 2048 unit LSTM)
lack of pre-training hurts performance across all the tasks, resulting in a 14.8% decrease

Conclusion

GPT-1 performed SOTA in 9 out of 12 tasks
decent zero-shot performance on various tasks
GPT-1 proved that LM served as an effective pre-training objective. The architecture facilitated transfer learning and could perform various NLP tasks with very little fine-tuning.

GPT-2

Main Idea

Learning Objectives & Concepts
- learning multiple tasks using the same unsupervised model (×supervised fine-tuning)
- objective: P(output|input) → P(output|input, task) [task conditioning]
Zero Shot Learning and Zero Shot Task Transfer
- no examples are provided and the model understands the task based on the given instruction
- E.g. (translate to french, english text, french text)
LM = Unsupervised Multitask Learning
- the supervised output is a subset of the language model sequence
- E.g1. “The translation of word Machine Learning in chinese is 機器學習.”
- E.g2. “The President of the United States is Trump.”

相比於有監督的多工學習，語言模型只是不需要顯示地定義哪些欄位是要預測的輸出，所以，實際上有監督的輸出只是語言模型序列中的一個子集

Model Architecture

在這裡插入圖片描述

GPT-2 has 1.5 billion parameters [GPT-1 (117M parameters)]
48 layers, 1600-dimension
Larger vocabulary of 50,257 tokens
Larger batch size of 512 and larger context window of 1024 tokens
Layer normalisation was moved to input of each sub-block and an additional layer normalisation was added after final self-attention block
At initialisation, the weight of residual layers was scaled by 1/√N, where N was the number of residual layers

scaled by 1/√N是因為殘差層的引數初始化根據網路深度進行調節。

Dataset & Experiment

WebText, had 40GB of text data from over 8 million documents (removed Wikipedia)
In French to English translation task, GPT-2 performed better than most unsupervised models in zero shot setting but did not outperform the SOTA unsupervised model
GPT-2 could not perform well on text summarization and its performance was similar or lesser than classic models trained for summarization

Generalization vs Memorization

最近的計算機視覺研究表明，影像資料集通常都會包含一些類似的影像，例如CIFAR-10在訓練集與測試集中就有3.3%的重複資料，這導致了對機器學習的泛化效能被過度高估。

Overlapping → over-reporting of the generalization performance of machine learning systems
Bloom filters containing 8-grams of WebText training set tokens
recommend the use of n-gram overlap based de-duplication as an important verification step and sanity check during the creation of training and test splits for new NLP datasets.
performance on training and test are similar and improve together as model size is increased → underfitting on WebText in many ways

Summary

achieve SOTA results on 7 out of 8 tested language modelling datasets in zero-shot
larger dataset & more parameters improved the capability of LM to understand tasks
How to use:

在這裡插入圖片描述

GPT-3

Introduction

Limitation: although task-agnostic, still a need for task-specific datasets and fine-tuning
BERT, etc:
① Excessive reliance on supervised data in the field
② Overfitting to the data distribution
$\Rightarrow$ Focusing on more general NLP model
$\Rightarrow$ Less supervised data, no fine-tuning.
Concepts
- In-context learning: Large language models develop pattern recognition and other skills using the text data they are trained on.
- Few-shot, one-shot and zero-shot setting: capability ↑ as capacity ↑

在這裡插入圖片描述

Model and Implementation details

GPT-3 has 96 layers with each layer having 96 attention heads.
Size of word embeddings: 1600 for GPT-2 → 12888 for GPT-3
Context window size: 1024 for GPT-2 → 2048 tokens for GPT-3
Alternating dense and locally banded sparse attention patterns
- sparse attention:
  
  使用交替的密集和區域性帶狀的稀疏注意模式(沒說細節，參考論文Generating Long Sequences with Sparse Transformers). Sparse Transformer只關注k個貢獻最大的狀態。通過顯式選擇，只關注少數幾個元素，與傳統的注意方法相比，對於與查詢不高度相關的值將被歸0。

Experiment

Dataset (45TB): trained on a mix of 5 different corpora, each having certain weight assigned to it. High quality datasets were sampled more often, and model was trained for more than 1 epoch
- downloaded and filtered a version of CommonCrawl
- fuzzy deduplication
- added high-quality reference corpora

Discussion & Broader Impacts

在這裡插入圖片描述

losing coherency while formulating long sentences and repeats sequences
does not perform very well on tasks like, fill in the blanks, some reading comprehension tasks etc.
Unidirectionality？
lacks the notion of task or goal-oriented prediction of tokens, suggests:
- augmentation of learning objective, use of reinforcement learning to fine tune models, etc.
complex & costly, heavy architecture, less interpretability
misuse of its human-like text generating capability for phishing, spamming, spreading misinformation
gender, ethnicity, race & religion bias

Conclusion

BIGGER
Under Few-shot setting, it surpasses the current Fine-tuning SOTA on some NLU tasks
performs well on downstream NLP tasks in zero-shot and few-shot setting: writing articles, summing up numbers, writing codes, etc.
Most impressive: generalization

幾個GPT-3的demo：

Ref：
[1] GPT-1論文翻譯
[2] 【論文筆記】GPT-1：Improving Language Understanding by Generative Pre-Training
[3] GPT——生成式預訓練Transformer
[4] The Journey of Open AI GPT models
[5] OpenAI GPT2原理解讀
[6] GPT2.0 Language Models are Unsupervised Multitask Learners 論文解讀
[7] 上車！帶你一文了解GPT-2模型（transformer語言模型視覺化）
[8] 總結GPT1和GPT2
[9] 直觀理解 GPT-2 語言模型並生成金庸武俠小說

論文筆記
2024-03-10
筆記
ICLR2021對比學習（Contrastive Learning）NLP領域論文進展梳理
2022-02-02
ICLRAST
BERT 論文筆記
2019-01-18
筆記
GhostNet論文筆記
2020-11-08
筆記
Louvain 論文筆記
2021-07-13
AI筆記
【論文筆記】UNet
2023-02-26
筆記
論文筆記：RankIQA
2021-05-22
筆記
1-3節筆記
2024-10-23
筆記
論文拆解：GPT-RE
2024-08-24
GPT
Raft論文讀書筆記
2018-07-11
Raft筆記
MapReduce 論文閱讀筆記
2020-06-24
筆記
9/12讀論文筆記
2024-09-12
筆記
AutoEmbedding論文閱讀筆記
2023-03-29
筆記
【GPT-4理論篇-1】GPT-4核心技術探秘
2023-05-19
GPT
Text Summarization with Pretrained Encoders 論文筆記
2020-10-24
AI筆記
【論文筆記】 Popularity Bias in Dynamic Recommendation
2021-12-25
筆記
【論文筆記】 Denoising Implicit Feedback for Recommendation
2021-12-18
筆記
GPT 1-3 簡單介紹
2024-11-03
GPT
CornerNet-Lite論文閱讀筆記
2020-10-31
筆記
【論文筆記】Bridging Hierarchical and Sequential Context Modeling
2020-10-10
筆記Context
Visual Instruction Tuning論文閱讀筆記
2024-06-07
Struct筆記
【論文筆記】Recommendations as Treatments: Debiasing Learning and Evaluation
2021-12-09
筆記
【論文筆記】Learning Fashion Compatibility with Bidirectional LSTMs
2021-04-11
筆記
【論文筆記】Shortest Paths and Centrality in Uncertain Networks
2020-11-29
筆記AI
論文學習筆記 - 高光譜和 LiDAR 融合分類合集
2020-11-18
筆記
ACL2020論文閱讀筆記：BART
2020-09-26
筆記
【論文筆記】A Survey on Deep Learning for Named Entity Recognition
2020-10-09
筆記
Reading Face, Read Health論文閱讀筆記
2020-10-31
筆記
Pixel Aligned Language Models論文閱讀筆記
2024-08-01
筆記
【論文筆記】SamWalker: Social Recommendation with Informative Sampling Strategy
2021-12-17
筆記ORM
【論文筆記】A review of applications in federated learning（綜述）
2022-05-01
筆記ViewAPP
[論文閱讀筆記] Structural Deep Network Embedding
2021-06-04
筆記Struct
【論文筆記】FCN全卷積網路
2023-02-13
筆記卷積
01-微服務1-3章的筆記
2024-09-26
微服務筆記
論文閱讀筆記：Fully Convolutional Networks for Semantic Segmentation
2019-01-20
筆記Segmentation
【論文筆記】Neural machine translation by jointly learning to align and translate
2018-12-02
筆記Mac
論文筆記 SimpleNet A Simple Network for Image Anomaly Detection and Localization
2024-03-28
筆記
【論文筆記-16~】多語言關係抽取
2024-04-30
筆記

論文筆記[4] GPT 1-3 梳理和對比

目錄

GPT-1

Motivation & Objectives

Framework

Unsupervised Language Modeling (Pre-training):

Supervised Fine-Tuning

Experiment

Discussion

Conclusion

GPT-2

Main Idea

Model Architecture

Dataset & Experiment

Generalization vs Memorization

Summary

GPT-3

Introduction

Model and Implementation details

Experiment

Discussion & Broader Impacts

Conclusion

相關文章