論文筆記[4] GPT 1-3 梳理和對比
論文題目:
Improving Language Understanding by Generative Pre-Training
Language Models are Unsupervised Multitask Learners
Language Models are Few-Shot Learners
論文傳送門: GPT1 GPT2 GPT3
論文團隊:OpenAI
GPT-1
Motivation & Objectives
-
most SOTA NLP models were trained specifically on a particular task
- Limitations:
① need large amount of annotated data, not easily available
② fail to generalize
- Limitations:
-
2 Challenges:
① which optimization objectives are most effective?
② what’s the most efficient way to transfer learned representations to target task?⇒ \Rightarrow ⇒ A semi-supervised approach using a combination of unsupervised pre-training and supervised fine-tuning
⇒ \Rightarrow ⇒ To learn a universal representation that transfers with little adaptation to a wide range of tasks -
learning a generative language model using unlabeled data and then fine-tuning the model by providing examples of specific downstream tasks
- 不清楚要使用哪種optimization objectives在學習到用於遷移的表示中最有用
- 哪種方式來transfer這些learned representations to the target task還沒達成共識(a combination of making task-specific changes to the model architecture,using intricate learning schemes,adding auxiliary(輔助的) learning objectives)
Framework
Unsupervised Language Modeling (Pre-training):
- multi-layer Transformer decoder (N = 12)
- NO Encoder-Decoder Attention Layer
-
去掉Encoder-Decoder Attention層作為模型的主體,然後將decoder的輸出經過一個softmax層,來產生目標詞的輸出分佈。
-
h n h_{n} hn可以理解為是輸出層對詞彙表中各個詞語的注意力權重;而 h n W e h_{n}W_{e} hnWe就是輸出層對各個token的注意力大小。經過預訓練的GPT中,儲存了從語料中學習到的語義和語法資訊。
Supervised Fine-Tuning
- objective to maximize
- Advatages:
① improving generalization of the supervised model
② accelerating convergence
λ \lambda λ是超引數,設定為0.5
Experiment
- Dataset: BooksCorpus (7000 unpublished books, unseen data) large stretches of contiguous text, which helped the model learn large range dependencies
- Unsupervised Training:
- Byte Pair Encoding (BPE) vocabulary with 40,000 merges was used
- 768-dimensional state for encoding tokens
- 12 layered model, 12 attention heads
- position-wise feed forward layer: 3072-dimensional
- 117M parameters
- Supervised Fine-tuning:
- 3 epochs for most of the downstream tasks → already learnt a lot during pre-training. Thus, minimal fine-tuning was enough
- Most of the hyperparameters from unsupervised pre-training were used for fine-tuning
- Works well across datasets of different sizes, from smaller datasets such as STS-B (5.7k training examples) – to the largest one – SNLI (550k training examples)
Discussion
-
Impact of number of layers transferred
-
Evolution of zero-shot performance on different tasks as a function of LM pre-training updates
- transformer block的個數越多,也就是語言模型越深,效果越好,說明語言模型的各個layer確實學到了不一樣的東西
- 不對語言模型進行fine-tuning時,pretrain語言模型的迭代次數越多,最後的效果越好,說明語言模型的pretrain確實學到了general的東西
-
Ablation studies
-
larger datasets benefit from the auxiliary objective but smaller datasets do not
-
5.6 average score drop when using LSTM (single layer 2048 unit LSTM)
-
lack of pre-training hurts performance across all the tasks, resulting in a 14.8% decrease
Conclusion
-
GPT-1 performed SOTA in 9 out of 12 tasks
-
decent zero-shot performance on various tasks
-
GPT-1 proved that LM served as an effective pre-training objective. The architecture facilitated transfer learning and could perform various NLP tasks with very little fine-tuning.
GPT-2
Main Idea
-
Learning Objectives & Concepts
- learning multiple tasks using the same unsupervised model (×supervised fine-tuning)
- objective: P(output|input) → P(output|input, task) [task conditioning]
-
Zero Shot Learning and Zero Shot Task Transfer
- no examples are provided and the model understands the task based on the given instruction
- E.g. (translate to french, english text, french text)
-
LM = Unsupervised Multitask Learning
- the supervised output is a subset of the language model sequence
- E.g1. “The translation of word Machine Learning in chinese is 機器學習.”
- E.g2. “The President of the United States is Trump.”
相比於有監督的多工學習,語言模型只是不需要顯示地定義哪些欄位是要預測的輸出,所以,實際上有監督的輸出只是語言模型序列中的一個子集
Model Architecture
- GPT-2 has 1.5 billion parameters [GPT-1 (117M parameters)]
- 48 layers, 1600-dimension
- Larger vocabulary of 50,257 tokens
- Larger batch size of 512 and larger context window of 1024 tokens
- Layer normalisation was moved to input of each sub-block and an additional layer normalisation was added after final self-attention block
- At initialisation, the weight of residual layers was scaled by 1/√N, where N was the number of residual layers
scaled by 1/√N是因為殘差層的引數初始化根據網路深度進行調節。
Dataset & Experiment
- WebText, had 40GB of text data from over 8 million documents (removed Wikipedia)
- In French to English translation task, GPT-2 performed better than most unsupervised models in zero shot setting but did not outperform the SOTA unsupervised model
- GPT-2 could not perform well on text summarization and its performance was similar or lesser than classic models trained for summarization
Generalization vs Memorization
最近的計算機視覺研究表明,影像資料集通常都會包含一些類似的影像,例如CIFAR-10在訓練集與測試集中就有3.3%的重複資料,這導致了對機器學習的泛化效能被過度高估。
- Overlapping → over-reporting of the generalization performance of machine learning systems
- Bloom filters containing 8-grams of WebText training set tokens
- recommend the use of n-gram overlap based de-duplication as an important verification step and sanity check during the creation of training and test splits for new NLP datasets.
- performance on training and test are similar and improve together as model size is increased → underfitting on WebText in many ways
Summary
- achieve SOTA results on 7 out of 8 tested language modelling datasets in zero-shot
- larger dataset & more parameters improved the capability of LM to understand tasks
- How to use:
GPT-3
Introduction
-
Limitation: although task-agnostic, still a need for task-specific datasets and fine-tuning
-
BERT, etc:
① Excessive reliance on supervised data in the field
② Overfitting to the data distribution
⇒ \Rightarrow ⇒Focusing on more general NLP model
⇒ \Rightarrow ⇒Less supervised data, no fine-tuning. -
Concepts
- In-context learning: Large language models develop pattern recognition and other skills using the text data they are trained on.
- Few-shot, one-shot and zero-shot setting: capability ↑ as capacity ↑
Model and Implementation details
- GPT-3 has 96 layers with each layer having 96 attention heads.
- Size of word embeddings: 1600 for GPT-2 → 12888 for GPT-3
- Context window size: 1024 for GPT-2 → 2048 tokens for GPT-3
- Alternating dense and locally banded sparse attention patterns
- sparse attention:
使用交替的密集和區域性帶狀的稀疏注意模式(沒說細節,參考論文Generating Long Sequences with Sparse Transformers). Sparse Transformer只關注k個貢獻最大的狀態。通過顯式選擇,只關注少數幾個元素,與傳統的注意方法相比,對於與查詢不高度相關的值將被歸0。
- sparse attention:
Experiment
- Dataset (45TB): trained on a mix of 5 different corpora, each having certain weight assigned to it. High quality datasets were sampled more often, and model was trained for more than 1 epoch
- downloaded and filtered a version of CommonCrawl
- fuzzy deduplication
- added high-quality reference corpora
Discussion & Broader Impacts
- losing coherency while formulating long sentences and repeats sequences
- does not perform very well on tasks like, fill in the blanks, some reading comprehension tasks etc.
- Unidirectionality?
- lacks the notion of task or goal-oriented prediction of tokens, suggests:
- augmentation of learning objective, use of reinforcement learning to fine tune models, etc.
- complex & costly, heavy architecture, less interpretability
- misuse of its human-like text generating capability for phishing, spamming, spreading misinformation
- gender, ethnicity, race & religion bias
Conclusion
-
BIGGER
-
Under Few-shot setting, it surpasses the current Fine-tuning SOTA on some NLU tasks
-
performs well on downstream NLP tasks in zero-shot and few-shot setting: writing articles, summing up numbers, writing codes, etc.
-
Most impressive: generalization
幾個GPT-3的demo:
Ref:
[1] GPT-1論文翻譯
[2] 【論文筆記】GPT-1:Improving Language Understanding by Generative Pre-Training
[3] GPT——生成式預訓練Transformer
[4] The Journey of Open AI GPT models
[5] OpenAI GPT2原理解讀
[6] GPT2.0 Language Models are Unsupervised Multitask Learners 論文解讀
[7] 上車!帶你一文了解GPT-2模型(transformer語言模型視覺化)
[8] 總結GPT1和GPT2
[9] 直觀理解 GPT-2 語言模型並生成金庸武俠小說
相關文章
- 論文筆記筆記
- ICLR2021對比學習(Contrastive Learning)NLP領域論文進展梳理ICLRAST
- BERT 論文筆記筆記
- GhostNet論文筆記筆記
- Louvain 論文筆記AI筆記
- 【論文筆記】UNet筆記
- 論文筆記:RankIQA筆記
- 1-3節筆記筆記
- 論文拆解:GPT-REGPT
- Raft論文讀書筆記Raft筆記
- MapReduce 論文閱讀筆記筆記
- 9/12讀論文筆記筆記
- AutoEmbedding論文閱讀筆記筆記
- 【GPT-4理論篇-1】GPT-4核心技術探秘GPT
- GPT 1-3 簡單介紹GPT
- Text Summarization with Pretrained Encoders 論文筆記AI筆記
- 【論文筆記】 Popularity Bias in Dynamic Recommendation筆記
- 【論文筆記】 Denoising Implicit Feedback for Recommendation筆記
- CornerNet-Lite論文閱讀筆記筆記
- 【論文筆記】Bridging Hierarchical and Sequential Context Modeling筆記Context
- Visual Instruction Tuning論文閱讀筆記Struct筆記
- 【論文筆記】Recommendations as Treatments: Debiasing Learning and Evaluation筆記
- 【論文筆記】Learning Fashion Compatibility with Bidirectional LSTMs筆記
- 【論文筆記】Shortest Paths and Centrality in Uncertain Networks筆記AI
- 論文學習筆記 - 高光譜 和 LiDAR 融合分類合集筆記
- ACL2020論文閱讀筆記:BART筆記
- 【論文筆記】A Survey on Deep Learning for Named Entity Recognition筆記
- Reading Face, Read Health論文閱讀筆記筆記
- Pixel Aligned Language Models論文閱讀筆記筆記
- 【論文筆記】SamWalker: Social Recommendation with Informative Sampling Strategy筆記ORM
- 【論文筆記】A review of applications in federated learning(綜述)筆記ViewAPP
- [論文閱讀筆記] Structural Deep Network Embedding筆記Struct
- 【論文筆記】FCN全卷積網路筆記卷積
- 01-微服務1-3章的筆記微服務筆記
- 論文閱讀筆記:Fully Convolutional Networks for Semantic Segmentation筆記Segmentation
- 【論文筆記】Neural machine translation by jointly learning to align and translate筆記Mac
- 論文筆記 SimpleNet A Simple Network for Image Anomaly Detection and Localization筆記
- 【論文筆記-16~】多語言關係抽取筆記