【論文閱讀筆記】Transformer——《Attention Is All You Need》

KeanShi發表於2024-11-08

原文網址 : https://www.cnblogs.com/keanshi/p/18535710

筆記ORM

論文地址：https://arxiv.org/pdf/1706.03762
程式碼地址：https://github.com/huggingface/transformers

【論文閱讀筆記】Transformer——《Attention Is All You Need》

Introduction
Background
Model Architecture
- Encoder
  - LN and BN
- Decoder
- Attention
- Multi-head Attention
- Feed-Forward
- Postion Encoding
Why self-attention

Introduction

RNN，LSTM 處理時序資訊的侷限性：無法並行，部分歷史資訊會在後面丟棄
編碼器與解碼器結構
proposed transformer：純注意力機制

Background

CNN 替換 RNN：無法對時序資訊進行建模——自注意力可以解決；CNN 可以多個輸出通道——多頭注意力機制
Memory-Network

Model Architecture

圖 1：Transformer 架構

Encoder：$\text{input}(x_1,...,x_n) \rightarrow \text{output: } \boldsymbol{z}=(z_1,...,z_n) $，其中 $z_t(1 \le t \le n)$ 為 $x_t$ 的向量表示
Decoder：$\text{input}(z_1,...,z_n) \rightarrow \text{output: } \boldsymbol{y}=(y_1,...,y_m) $：一個一個生成（auto-regression），上一時刻輸出為下一時刻輸入

Encoder

2 個子層：$LayerNorm(x+Sublayer(x)) * 2$，每一層輸出均為 512 維

LN and BN

如果 feature 長度不同：BN 更加不穩定！

預測過程

BN：要記錄全域性的 $\mu$ 和 $\sigma$，如果有的 feature 很長訓練沒見過，$\mu$ 和 $\sigma$ 就不合適了
LN：每個樣本內部計算，不受全域性影響，受長度影響很小

Decoder

3 個字層：$LayerNorm(x+Sublayer(x)) * 3$，自迴歸

Attention

圖 2：dot-product attention 結構

其中 $Q,K$ 是 $d_k$ 維，$V$ 是 $d_v$ 維，attention計算公式：

\[\text{Attention}(Q,K,V)=\text{softmax}(\frac{QK^T}{\sqrt{d_k}})V \]

為什麼是 scaled dot-product attention？

因為點乘非常簡單，兩次矩陣乘法易於平行計算
當 $d_k$ 比較大，$\text{softmax}(QK^T)$ 分佈發散，梯度比較小收斂慢

Mask 操作

$QK^T$ 後將 $t$ 位置之後變為很小的負數（如$-1e10$），softmax 後為 0

Multi-head Attention

能夠學習到不同的投影，不同特徵。計算公式如下：

\[\text{MultiHead}(Q,K,V)=\text{Concat}(head_1,...,head_n)W^o \]

\[where \ \ \ head_i=\text{Attention}(QW_i^Q,KW_i^K,VW_i^V) \]

Feed-Forward

\[\text{FFN}(x)=max(0,xW_1+b_1)W_2+b_2 \]

Postion Encoding

加入時序資訊：

\[\text{PE}(pos,2i)=\sin(pos/10000^{2i/d}) \]

\[\text{PE}(pos,2i+1)=\cos(pos/10000^{2i/d}) \]

Why self-attention

理解BERT Transformer：Attention is not all you need！
2019-04-05
ORM
經典譯文：Transformer--Attention Is All You Need
2024-05-12
ORM
馬斯克開源的 grok-1 底層 Transformer 模型論文《Attention is All You Need》
2024-03-25
馬斯克ORM模型
【P5】Attention Is All You Need
2020-11-13
Transformer網路-Self-attention is all your need
2023-04-15
ORM
Attention isn’t all you need！Mamba混合大模型開源：三倍Transformer吞吐量
2024-03-29
大模型ORM
MapReduce 論文閱讀筆記
2020-06-24
筆記
AutoEmbedding論文閱讀筆記
2023-03-29
筆記
閱讀論文：《Compositional Attention Networks for Machine Reasoning》
2022-04-10
Mac
Attention isn’t all you need！BERT的力量之源遠不止注意力
2019-03-05
[論文閱讀] Residual Attention(Multi-Label Recognition)
2021-08-15
CornerNet-Lite論文閱讀筆記
2020-10-31
筆記
Visual Instruction Tuning論文閱讀筆記
2024-06-07
Struct筆記
目標檢測：Segmentation is All You Need ？
2019-05-07
Segmentation
ACL2020論文閱讀筆記：BART
2020-09-26
筆記
Reading Face, Read Health論文閱讀筆記
2020-10-31
筆記
Pixel Aligned Language Models論文閱讀筆記
2024-08-01
筆記
[論文閱讀筆記] Structural Deep Network Embedding
2021-06-04
筆記Struct
【論文閱讀】CVPR2022: Learning from all vehicles
2022-03-23
論文閱讀筆記：Fully Convolutional Networks for Semantic Segmentation
2019-01-20
筆記Segmentation
[論文閱讀筆記] Adversarial Learning on Heterogeneous Information Networks
2021-06-05
筆記ORM
『論文精讀』Vision Transformer(VIT)論文解讀
2024-04-25
ORM
【論文閱讀筆記】An Improved Neural Baseline for Temporal Relation Extraction
2020-11-20
筆記
[論文閱讀筆記] Community aware random walk for network embedding
2021-05-30
筆記Unityrandom
[論文閱讀筆記] Adversarial Mutual Information Learning for Network Embedding
2021-06-12
筆記ORM
[Paper Reading] KOSMOS: Language Is Not All You Need: Aligning Perception with Language Models
2024-03-27
論文閱讀筆記：LINE: Large-scale Information Network Embedding
2020-11-17
筆記ORM
兩篇知識表示方面的論文閱讀筆記
2020-10-05
筆記
[論文閱讀筆記] Unsupervised Attributed Network Embedding via Cross Fusion
2021-06-06
筆記ROS
論文閱讀筆記：A Two-Step Approach for Event Factuality Identification
2020-12-03
筆記APPIDE
Image Super-Resolution Using DeepConvolutional Networks論文閱讀筆記
2021-01-04
筆記
【論文閱讀】Informer Beyond Efficient Transformer for Long Sequence Time-Series Forecasting
2024-03-12
ORMAST
【論文閱讀筆記】多模態大語言模型必讀 —— LLaVA
2024-11-20
筆記模型
【論文閱讀筆記】Aspect-based sentiment analysis with alternating coattention networks
2020-11-01
筆記
共識演算法論文閱讀筆記1-hotstuff
2024-08-10
演算法筆記
[論文閱讀筆記] Are Meta-Paths Necessary, Revisiting Heterogeneous Graph Embeddings
2021-05-31
筆記
論文閱讀筆記（四）：AS-MLP AN AXIAL SHIFTED MLP ARCHITECTUREFOR VISION
2023-03-02
筆記
Raft論文讀書筆記
2018-07-11
Raft筆記