Attention與SelfAttention

iSherryZhang發表於2023-03-17

原文網址 : https://www.cnblogs.com/shuezhang/p/17228420.html

Seq2Seq + Attention

Seq2Seq模型，有一個Encoder和一個Decoder，預設認為Encoder的輸出狀態h_m包含整個句子的資訊，作為Decoder的輸入狀態s_0完成整個文字生成過程。這有一個嚴重的問題就是，最後的狀態不能記住長序列，也就是會遺忘資訊，那麼Decoder也就無法獲得此資訊。

用傳統的Seq2Seq模型，當句子長度超過20個單詞是，BLEU Score（機器翻譯評價指標）就會下降；但是如果用上Attention，就會如下圖紅色曲線一樣，即使輸入序列很長也能保持較高的準確率。

使用Attention解決機器翻譯的原文為：Bahdanau, Cho, & Bengio, Neural machine translation by jointly learning to align and translate. In ICLR, 2015.

Attention能夠極大提升Seq2Seq模型的準確率；用了Attention，Decoder每次更新狀態的時候都會看一下Encoder的所有狀態，這樣子就不會遺忘了；Attention還可以告訴Decoder應該關注Encoder的哪個狀態，這就是Attention名字的由來。Attention有一個極大的缺點是，計算量很大。

Attention tremendously improves Seq2Seq model
With attention, Seq2Seq model does not forget source input
With attention, the decoder knows where to focus
Downside: much more computation

Attention的原理

Attention使用\(c_i\)整合\(h_1, h_2, ..., h_m\)的資訊，因此Attention機制可以解決LSTM遺忘的問題。

\(c_0 = \alpha_1h_1 + \alpha_2h_2 + ... + \alpha_mh_m\)，其中，\(\alpha_i\)表示\(h_i\)和\(s_0\)的相關性，稱為權重。

相關性的計算方法有兩種：

方法一（Used in the original paper）

求\(h_i\)和\(s_0\)的相關性，將\(h_i\)和\(s_0\)進行Concatenate，然後乘一個引數矩陣\(W\)，結果進行\(tanh\)約束到(-1, 1)之間，然後再乘以一個\(v^T\)，並對得到的結果進行Softmax處理。

方法二（more popular，the same to Transformer）

求\(h_i\)和\(s_0\)的相關性，分為三步進行計算：

Linear maps
- \(k_i = W_K · h_i\)
- \(q_0 = W_Q · s_0\)
Inner product
- \(\widetilde{\alpha_i} = k^T_{i}q_0\)
Normalization
- \([\alpha_1, ..., \alpha_m] = Softmax([\widetilde{\alpha_1}, ... \widetilde{\alpha_m}])\)

計算得到\(c_0\)後，將\(A'\)的三個輸入進行concatenate，作為輸入得到狀態\(s_1\)。每一個狀態\(s_i\)對應一個Context向量\(c_i\)來表示\(s_i\)與\(H\)的相關性。

假設Encoder有m步，Decoder有t步，就需要計算mt次權重，每次權重計算都要計算m個\(\alpha\)的值。所以，Attention的時間複雜度是mt，也就是Encoder和Decoder狀態數量的乘積。

Attention在機器翻譯任務的視覺化，可以看到Decoder與Encoder的每個狀態都相關，但是會重點關注某個或某些狀態。

Summary

優點：

Standard Seq2Seq model：decoder只關注其當前狀態
Attention：decoder還會關注encoders的所有狀態解決遺忘問題並且告訴decoder哪裡需要重點關注

缺點：高時間複雜度（假設源序列的長度為m，目標序列的長度是t）

Standard Seq2Seq：\(O(m + t)\)
Seq2Seq + attention：\(O(mt)\)

Self Attention

之前RNN裡面，使用\(h_4\)和\(x_5\)計算得到\(h_5\)，使用self-attention機制，當前狀態\(h_5\)的計算依賴由\(h_4\)變為\(c_4\)。\(c_4 = \alpha_1h_1 + \alpha_2h_2 + \alpha_3h_3 + \alpha_4h_4\)，其中，\(\alpha_i\)計算的是\(h_4\)與\(h_i\)之間的相關性，計算方式前面已經講過。因為這裡會計算自己與自己的相關性，因此稱為self-attention。

SimpleRNN與Attention當前狀態計算對比

SimpleRNN狀態\(h_5\)的計算：

\(h_5 = tanh(A·{x_5\brack h_4} + b)\)

Self-Attention狀態\(h_5\)的計算：

\(h_5 = tanh(A·{x_5\brack c_4} + b)\)

Reference

王樹森的Attention機制講解

Attention
2024-03-15
Attention的基本原理與模型結構
2020-11-28
模型
機器閱讀理解Attention-over-Attention模型
2021-09-09
模型
Self-Attention GAN 中的 self-attention 機制
2019-03-06
注意力(Attention)與Seq2Seq的區別
2021-02-13
sparse_cross_attention
2024-11-23
ROS
Matters Needing Attention as A SAP Freelancer
2022-06-04
Attention機制全流程詳解與細節學習筆記
2020-12-02
筆記
BiLSTM-Attention文字分類
2020-04-22
文字分類
理解BERT Transformer：Attention is not all you need！
2019-04-05
ORM
GAT: Graph Attention Network | 論文分享
2019-03-29
輕鬆理解 Transformers（2）：Attention部分
2023-10-30
ORM
【P5】Attention Is All You Need
2020-11-13
圖學習(一)Graph Attention Networks
2020-11-29
Attention Model（注意力模型）思想初探
2018-09-29
模型
attention注意力機制學習
2020-11-06
淺析注意力(Attention)機制
2024-11-17
transformer中的attention機制詳解
2024-07-02
ORM
8.1 Attention（注意力機制）和Transformer
2020-01-08
ORM
大模型學習筆記：attention 機制
2024-11-24
大模型筆記
An Attentive Inductive Bias for Sequential Recommendation beyond the Self-Attention
2024-07-07
RealFormer: 殘差式 Attention 層的Transformer 模型
2022-02-08
ORM模型
Transformer網路-Self-attention is all your need
2023-04-15
ORM
閱讀論文：《Compositional Attention Networks for Machine Reasoning》
2022-04-10
Mac
關於attention中對padding的處理：mask
2024-05-21
padding
經典譯文：Transformer--Attention Is All You Need
2024-05-12
ORM
去噪論文 Attention-Guided CNN for Image Denoising
2020-10-12
GUIIDECNN
[論文閱讀] Residual Attention(Multi-Label Recognition)
2021-08-15
論文解讀（SAGPool）《Self-Attention Graph Pooling》
2022-05-08
譯文：Relation Classification via Multi-Level Attention CNNs 使用多層級attention機制的CNN進行關係分類
2020-04-07
CNN
ICLR2021-1：MULTI-HOP ATTENTION GRAPH NEURAL NETWORKS
2020-10-29
ICLR
注意力機制----RNN中的self-attention
2020-11-08
RNN
基於attention的半監督GCN | 論文分享
2019-03-05
GC
【論文閱讀筆記】Transformer——《Attention Is All You Need》
2024-11-08
筆記ORM
論文解讀（AGCN）《 Attention-driven Graph Clustering Network》
2022-02-17
GC
卷積塊注意模組 CBAM: Convolutional Block Attention Module
2020-12-05
卷積BloC
深度學習中的注意力機制(Attention Model)
2018-11-05
深度學習
透過打包 Flash Attention 來提升 Hugging Face 訓練效率
2024-09-12
Hugging Face

Attention與SelfAttention

Seq2Seq + Attention

Attention的原理

方法一（Used in the original paper）

方法二（more popular，the same to Transformer）

Summary

Self Attention

SimpleRNN與Attention當前狀態計算對比

Reference

相關文章