Attention Model（注意力模型）思想初探

Andrew.Hann發表於2018-09-29

原文網址 : https://www.cnblogs.com/LittleHann/p/9722779.html

1. Attention model簡介

0x1：AM是什麼

深度學習裡的Attention model其實模擬的是人腦的注意力模型，舉個例子來說，當我們觀賞一幅畫時，雖然我們可以看到整幅畫的全貌，但是在我們深入仔細地觀察時，其實眼睛聚焦的就只有很小的一塊，這個時候人的大腦主要關注在這一小塊圖案上，也就是說這個時候人腦對整幅圖的關注並不是均衡的，是有一定的權重區分的。這就是深度學習裡的Attention Model的核心思想。

AM剛開始是應用在影象領域裡的，並且在影象處理領域取得了非常好的效果，之後，就有人開始研究怎麼將AM模型引入到NLP領域。最早提出 attention 思想的是這篇paepr，“Neural machine translation by jointly learning to align and translate”，這篇論文最早提出了Soft Attention Model，並將其應用到了機器翻譯領域。

0x2：AM在機器翻譯中的應用

Encoder-Decoder模型

Relevant Link:

https://blog.csdn.net/mpk_no1/article/details/72862348
https://machinelearningmastery.com/encoder-decoder-attention-sequence-to-sequence-prediction-keras/
https://arxiv.org/pdf/1409.0473.pdf
https://www.zhihu.com/question/36591394

0x3：Attention Mechanism分類

1. hard: Attention和soft: Attention

簡單來說，soft attention是對輸入向量的所有維度都計算一個關注權重，根據重要性賦予不同的權重。

而hard attention是針對輸入向量計算得到一個唯一的確定權重，例如加權平均。

2. Global Attention 和 Local Attention

3. Self Attention

Self Attention與傳統的Attention機制非常的不同：

傳統的Attention是基於source端和target端的隱變數（hidden state）計算Attention的，得到的結果是源端的每個詞與目標端每個詞之間的依賴關係。

但Self Attention不同，它分別在source端和target端進行，僅與source input或者target input自身相關的Self Attention，捕捉source端或target端自身的詞與詞之間的依賴關係；然後再把source端的得到的self Attention加入到target端得到的Attention中，捕捉source端和target端詞與詞之間的依賴關係。

因此，self Attention Attention比傳統的Attention mechanism效果要好，主要原因之一是：

傳統的Attention機制忽略了源端或目標端句子中詞與詞之間的依賴關係，相對比，self Attention可以不僅可以得到源端與目標端詞與詞之間的依賴關係，同時還可以有效獲取源端或目標端自身詞與詞之間的依賴關係，如下圖所示。

Relevant Link:

https://zhuanlan.zhihu.com/p/31547842 
https://blog.csdn.net/jteng/article/details/52864401

2. 通過一個簡單的例子來理解attention model的思想原理

需要明白的是，AM不是一個具體的演算法或者模型，AM更多的是一種思想，筆者覺得它實質上是一種更加合理的深度神經網路結構設計思想，以及特徵權重調整策略。

0x1：Dense Layer - 在DNN隱層中加入soft attention機制

這個小節，我們通過一個簡單的DNN神經網路裡展示AM思想。

現在我們有一個dim=32維度的輸入vector，我們正在設計一個DNN網路結構，來對這個dim32 vector進行進行分類預測。

在開始寫程式碼之前，我們通過觀察資料的概率分佈，發現了一個很有趣的現象，訓練資料對應的特徵向量中有一個維度起到了決定性的作用，輸入資料如下圖

testing_inputs_1 [[-7.03187310e-01  1.00000000e+00 -3.21814330e-01 -1.75507872e+00
   2.06664470e-01 -2.01126457e+00 -5.57250708e-01  3.37217008e-01
   1.54883597e+00 -1.37073656e+00  1.42529140e+00 -2.79463910e-01
  -5.59627907e-01  1.18638337e+00  1.69851891e+00 -1.69122016e+00
  -6.99522844e-01  5.82962842e-01  9.78222630e-01 -1.21737211e+00
  -1.32939545e+00 -1.45474227e-03 -1.31465268e+00 -3.79611743e-01
   1.26521065e+00  1.20667744e-01  1.47941778e-01 -2.75372579e+00
  -3.56896324e-01  7.71783656e-03  1.47827716e+00 -9.57614629e-01]
 [ 1.32900811e+00  0.00000000e+00  4.71557202e-01 -8.74652950e-03
   3.67018689e-01  1.11855474e+00 -8.38993512e-03  4.66315379e-01
   1.26326870e+00 -9.01654654e-01 -1.02884269e+00  5.69678421e-01
   6.41664780e-01  2.59811930e-01  1.19317814e+00 -1.04630036e+00
   1.39888921e-01 -1.73065584e+00 -1.30623116e-01 -1.31026002e+00
  -2.17131242e+00 -1.06618141e+00 -3.31618443e-02  1.46639575e+00
   8.76643096e-01  6.69989580e-01  6.97449511e-01 -2.52785434e-01
   5.67987107e-01  3.04387858e-01 -1.00002960e+00 -2.45641783e+00]
 [ 2.52307022e-01  1.00000000e+00 -1.58345465e+00  1.98042282e-01
   8.52522298e-02  6.40507750e-01 -7.90658155e-01  7.71182395e-01
  -1.95067777e+00 -1.29401021e+00 -1.07352377e+00  3.06910919e-02
   7.74109345e-01 -8.71396303e-01  1.66344014e-01  6.35789777e-01
   1.08167197e+00 -2.82773662e-01  1.55478794e+00 -8.58308135e-01
  -2.79650432e-01 -8.54234325e-02 -2.19597647e-01 -2.17359887e+00
   9.06332427e-01  7.50338575e-01 -5.75259737e-01 -3.68953224e-01
   7.65748246e-01 -1.10066159e+00  7.33829660e-01 -3.15740222e-02]
 [-1.27394186e+00  0.00000000e+00 -5.42515179e-01 -1.05202857e+00
  -7.75720653e-01 -1.23228165e-01 -5.36931271e-01  1.65373406e-01
   8.99855721e-01  1.25719599e+00  1.15406861e+00 -6.74225801e-01
   8.83266671e-01 -1.80074100e+00  3.15524021e-01 -2.98942433e-01
   9.23266706e-01 -8.64610423e-01  9.06323896e-01  1.43665365e-01
  -4.28784038e-01  4.36334858e-02 -1.15963013e+00 -1.44581716e-01
   1.06269721e+00  1.50348168e+00  8.90477309e-01  1.10184730e-01
  -2.80878365e-01  4.70876779e-01 -1.22654812e-01  1.80971612e+00]
 [-2.11504034e-01  0.00000000e+00  5.60009299e-01 -1.17945640e+00
  -4.67803781e-01 -1.74241319e+00 -3.70322401e-03 -2.17006719e+00
   4.24510049e-01  1.46478639e-01  5.92744407e-02 -4.91253927e-01
  -1.01717308e+00  4.19307196e-01 -7.71367508e-01  1.43788652e+00
   2.68676712e+00  3.96732882e-01  4.76923961e-01  8.15901697e-01
  -5.03092218e-01  1.44864196e-01  3.91584490e-02 -6.12835945e-01
   7.00882108e-01  9.76864848e-01 -6.30941522e-01 -8.38602720e-01
  -4.39203663e-01 -1.36452679e+00 -1.27237114e+00  8.60190888e-01]
 [ 9.14860457e-01  1.00000000e+00  1.56077637e-01  1.15855621e+00
  -4.98210125e-01  1.67069107e+00  4.31765280e-01  4.26712047e-01
   9.86745986e-01  9.77680603e-01 -1.06466820e+00  5.38847940e-01
   8.43082569e-01  9.00722906e-01 -8.01677331e-01  4.87130812e-01
  -3.58399587e-01  1.20297675e+00  4.58699197e-01 -1.11963082e+00
   3.35130398e-01 -6.86900220e-01  1.20681682e+00  1.91752106e+00
   5.42198956e-01  7.22353555e-01 -1.74881350e-01 -1.15996824e-01
  -1.98712683e+00  9.98292115e-03  7.12149198e-02 -1.75004126e+00]
 [ 5.54438377e-01  0.00000000e+00  1.72070508e+00 -2.39421276e+00
  -4.38335835e-01  1.22198125e+00  3.74376988e-01 -1.38100426e+00
  -6.76686553e-01  4.07591917e-01  5.93619771e-01  7.83618421e-01
   6.73002113e-01  4.78781433e-01  8.39040116e-01  8.69123716e-01
   1.34632773e+00  1.36734769e+00  3.66827392e-01  3.60041568e-01
   6.66945023e-01 -1.14536483e+00  4.38891453e-01 -4.37844713e-01
  -4.65689776e-01  3.12033012e-02 -8.19522312e-01  7.58853868e-01
   5.18056531e-01  4.28196906e-01  2.08135008e-01  1.24826488e+00]
 [ 1.04258559e+00  0.00000000e+00 -5.93238790e-01  1.52406418e+00
   1.21646035e+00  1.05836917e+00 -5.16890856e-01  1.08085391e+00
  -1.38284038e+00  1.06456352e-01  2.74257861e-01 -1.63748280e+00
   9.94120958e-01 -1.36070702e+00 -3.46128572e-01  1.56069434e+00
   6.36408438e-01 -2.13655632e-01 -5.30028711e-01 -1.14739552e+00
  -1.33102035e+00  8.67112945e-01  1.01777222e-01 -5.65421800e-01
   5.44866549e-01 -5.88216752e-01 -1.53028975e+00 -1.05510083e+00
   1.23102591e+00  1.49268412e+00  1.09572693e+00 -8.32754259e-01]
 [ 1.42119684e+00  1.00000000e+00 -6.68588743e-01  2.06587470e+00
   6.73939981e-01  1.78367879e-01  1.20959596e+00  2.05228057e+00
   1.17298340e+00 -2.99209254e-01  1.54491060e+00  5.13288354e-01
  -4.70304173e-01 -3.10097090e-01 -4.28043935e-01 -1.40723789e+00
  -7.96590363e-01 -8.85643489e-01  2.11063371e+00  1.07039253e+00
   1.39945292e+00  5.71403123e-01  2.75430532e-01 -1.99253003e-01
  -3.59019207e-01  1.26609682e-01 -1.69233428e+00  1.33714780e+00
  -1.10716769e+00 -5.72247993e-01  8.97152528e-01 -1.28169975e+00]
 [-1.89902418e+00  0.00000000e+00 -2.82853143e-01 -4.48757897e-01
   1.14923027e+00 -9.81086421e-01 -1.43486014e+00 -7.53626739e-01
   1.37505923e+00  6.51163018e-03 -5.37901188e-01  4.93670710e-01
  -8.27477300e-01  2.21030844e-01 -5.26978585e-01 -4.00566932e-01
  -4.59691412e-01 -1.87982990e+00  5.19494331e-01 -1.77753816e+00
  -2.89858663e-01  3.67898297e-01  9.63175026e-01 -4.51156518e-01
  -1.43890933e-01 -6.47600423e-01  7.69697009e-01 -1.29930416e+00
   7.55207368e-01  1.29158295e-01  1.12152724e+00 -3.52497951e-01]]
testing_outputs [[1]
 [0]
 [1]
 [0]
 [0]
 [1]
 [0]
 [0]
 [1]
 [0]]

從圖中可以看到：

1. A vector v of 32 values as input to the model (simple feedforward neural network).
2. v[1] = target.
3. Target is binary (either 0 or 1).
4. All the other values of the vector v (v[0] and v[2:32]) are purely random and do not contribute to the target.

按照rule-based或者決策樹的思想，僅僅根據特徵進行判斷，就可以獲得非常好的模型效能。

現在問題來了，我不想用決策樹，因為決策樹太“硬”了，損失掉了很多輸入資料中的概率分佈資訊，深度DNN的這種複雜非線性組合能夠獲得更“軟”的概率分佈擬合能力。

那有什麼辦法能將更好地將這個先驗知識融合到模型中呢？（即強制模型更加關注那個決定性較強的特徵維度，而相對忽略其他特徵維度）

答案是attention model思想。

inputs = Input(shape=(input_dim,))

# ATTENTION PART STARTS HERE
attention_probs = Dense(input_dim, activation='softmax', name='attention_vec')(inputs)
attention_mul = Multiply()([inputs, attention_probs])
# ATTENTION PART FINISHES HERE

attention_mul = Dense(64)(attention_mul)
output = Dense(1, activation='sigmoid')(attention_mul)
model = Model(input=[inputs], output=output)

我們在輸入層input之後增加了一個Dense層，並使用softmax啟用函式，神經元個數=輸入向量的維度。這一層的核心作用就是通過softmax從input中選擇對target貢獻度最大的一個vector dim維度。

之後通過merge該attention model layer和input輸入層，通過一個DNN隱層進行綜合決策。

通過BP反饋訓練後，attention medel layer的權重

import numpy as np

from attention_utils import get_activations, get_data

np.random.seed(1337)  # for reproducibility
from keras.models import *
from keras.layers import Input, Dense, Multiply

input_dim = 32


def build_model():
    inputs = Input(shape=(input_dim,))

    # ATTENTION PART STARTS HERE
    attention_probs = Dense(input_dim, activation='softmax', name='attention_vec')(inputs)
    attention_mul = Multiply()([inputs, attention_probs])
    # ATTENTION PART FINISHES HERE

    attention_mul = Dense(64)(attention_mul)
    output = Dense(1, activation='sigmoid')(attention_mul)
    model = Model(input=[inputs], output=output)
    return model


def main():
    N = 10000
    inputs_1, outputs = get_data(N, input_dim)

    m = build_model()
    m.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    print(m.summary())

    m.fit([inputs_1], outputs, epochs=20, batch_size=64, validation_split=0.5)

    testing_inputs_1, testing_outputs = get_data(1, input_dim)
    print "testing_inputs_1", testing_inputs_1
    print "testing_outputs", testing_outputs

    # Attention vector corresponds to the second matrix.
    # The first one is the Inputs output.
    attention_vector = get_activations(m, testing_inputs_1,
                                       print_shape_only=True,
                                       layer_name='attention_vec')[0].flatten()
    print('attention =', attention_vector)

    # plot part.
    import matplotlib.pyplot as plt
    import pandas as pd

    pd.DataFrame(attention_vector, columns=['attention (%)']).plot(kind='bar',
                                                                   title='Attention Mechanism as '
                                                                         'a function of input'
                                                                         ' dimensions.')
    plt.show()


if __name__ == '__main__':
    main()

可以看到，v[1] 獲得了絕對的dominate權重

0x2：LSTM/GRU Layer

這個小節我們對比下在LSTM前/後插入attention model layer，對各個維度的權重關注效果。

from keras.layers import Multiply
from keras.layers.core import *
from keras.layers.recurrent import LSTM
from keras.models import *

from attention_utils import get_activations, get_data_recurrent

INPUT_DIM = 2
TIME_STEPS = 20
# if True, the attention vector is shared across the input_dimensions where the attention is applied.
SINGLE_ATTENTION_VECTOR = False
APPLY_ATTENTION_BEFORE_LSTM = False


def attention_3d_block(inputs):
    # inputs.shape = (batch_size, time_steps, input_dim)
    input_dim = int(inputs.shape[2])
    a = Permute((2, 1))(inputs)
    a = Reshape((input_dim, TIME_STEPS))(a) # this line is not useful. It's just to know which dimension is what.
    a = Dense(TIME_STEPS, activation='softmax')(a)
    if SINGLE_ATTENTION_VECTOR:
        a = Lambda(lambda x: K.mean(x, axis=1), name='dim_reduction')(a)
        a = RepeatVector(input_dim)(a)
    a_probs = Permute((2, 1), name='attention_vec')(a)
    output_attention_mul = Multiply()([inputs, a_probs])
    return output_attention_mul


def model_attention_applied_after_lstm():
    inputs = Input(shape=(TIME_STEPS, INPUT_DIM,))
    lstm_units = 32
    lstm_out = LSTM(lstm_units, return_sequences=True)(inputs)
    attention_mul = attention_3d_block(lstm_out)
    attention_mul = Flatten()(attention_mul)
    output = Dense(1, activation='sigmoid')(attention_mul)
    model = Model(input=[inputs], output=output)
    return model


def model_attention_applied_before_lstm():
    inputs = Input(shape=(TIME_STEPS, INPUT_DIM,))
    attention_mul = attention_3d_block(inputs)
    lstm_units = 32
    attention_mul = LSTM(lstm_units, return_sequences=False)(attention_mul)
    output = Dense(1, activation='sigmoid')(attention_mul)
    model = Model(input=[inputs], output=output)
    return model


if __name__ == '__main__':

    N = 300000
    # N = 300 -> too few = no training
    inputs_1, outputs = get_data_recurrent(N, TIME_STEPS, INPUT_DIM)

    if APPLY_ATTENTION_BEFORE_LSTM:
        m = model_attention_applied_before_lstm()
    else:
        m = model_attention_applied_after_lstm()

    m.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    print(m.summary())

    m.fit([inputs_1], outputs, epochs=1, batch_size=64, validation_split=0.1)

    attention_vectors = []
    for i in range(300):
        testing_inputs_1, testing_outputs = get_data_recurrent(1, TIME_STEPS, INPUT_DIM)
        attention_vector = np.mean(get_activations(m,
                                                   testing_inputs_1,
                                                   print_shape_only=True,
                                                   layer_name='attention_vec')[0], axis=2).squeeze()
        print('attention =', attention_vector)
        assert (np.sum(attention_vector) - 1.0) < 1e-5
        attention_vectors.append(attention_vector)

    attention_vector_final = np.mean(np.array(attention_vectors), axis=0)
    # plot part.
    import matplotlib.pyplot as plt
    import pandas as pd

    pd.DataFrame(attention_vector_final, columns=['attention (%)']).plot(kind='bar',
                                                                         title='Attention Mechanism as '
                                                                               'a function of input'
                                                                               ' dimensions.')
    plt.show()

1. Directly on the inputs (same as the Dense example above): `APPLY_ATTENTION_BEFORE_LSTM = True`

直接作用於input層的attention可以讓我們獲得對輸入特徵空間的重要性理解。

2. After the LSTM layer: `APPLY_ATTENTION_BEFORE_LSTM = False`

後置的attention layer可以讓模型的最終決策更加聚焦，將主要的決策權重分配在真正對最終分類有正向幫助的特徵維度上，只是這時候，輸入attention layer的特徵維度是已經經過LSTM抽象過的特徵空間，可解釋性已經相對較差了。

Relevant Link:

https://github.com/philipperemy/keras-attention-mechanism

3. attention model在安全中有什麼作用？

筆者對這個model的原理的理解還不是非常深刻，還在實踐中逐漸摸索中，這裡談一些已經在專案中通過大資料集驗證過的的場景。有不對之處，望不吝指正。

0x1：包含惡意指令的正常檔案

在安全攻防中，有一個很常見的場景是，惡意軟體或者黑客會通過自動化的方式將惡意的shellcode或者惡意的指令碼程式碼插入到正常的檔案中。這種黑客技術在對抗上會產生幾個問題：

1. 傳統的特徵碼檢測技術可能不會受到影響，因為依然會匹配到這段惡意程式碼
2. 基於異常行為的檢測技術（例如sandbox重放檢測）可能會遭到繞過，因為這個時候整個程式的執行時期間的api call序列可能會呈現出一個正常模式
3. 基於深度學習的檢測技術會受到挑戰，CNN卷積網路可能不會受到影響，但是對訓練樣本集的數量和種類的要求就會提高

深度學習中的注意力機制(Attention Model)
2018-11-05
深度學習
attention注意力機制學習
2020-11-06
淺析注意力(Attention)機制
2024-11-17
8.1 Attention（注意力機制）和Transformer
2020-01-08
ORM
SAP Distribution Model初探
2021-07-14
注意力機制----RNN中的self-attention
2020-11-08
RNN
機器閱讀理解Attention-over-Attention模型
2021-09-09
模型
注意力(Attention)與Seq2Seq的區別
2021-02-13
【吳恩達深度學習筆記】5.3序列模型和注意力機制Sequence models&Attention mechanism
2020-12-09
吳恩達深度學習筆記模型
混合模型初探
2018-03-03
模型
“盒模型”初探
2021-06-02
模型
解碼注意力Attention機制：從技術解析到PyTorch實戰
2023-11-01
PyTorch
Attention isn’t all you need！BERT的力量之源遠不止注意力
2019-03-05
【機器學習】李宏毅——自注意力機制(Self-attention)
2022-12-16
機器學習
大模型學習筆記：attention 機制
2024-11-24
大模型筆記
RealFormer: 殘差式 Attention 層的Transformer 模型
2022-02-08
ORM模型
CDM（Conceptual Data Model，概念資料模型）和 PDM（Physical Data Model，物理資料模型）
2024-08-03
模型
Attention的基本原理與模型結構
2020-11-28
模型
DOM （文件物件模型(Document Object Model)）
2019-04-03
物件模型Object
盒子模型Box Model簡介
2020-12-08
模型
EF Core預編譯模型Compiled Model
2023-11-20
編譯模型Compile
ASP.NET Core MVC 之模型（Model）
2019-07-20
ASP.NETMVC模型
儲存載入模型model.save()
2020-12-15
模型
Attention
2024-03-15
三種Transformer模型中的注意力機制介紹及Pytorch實現：從自注意力到因果自注意力
2024-10-13
ORM模型PyTorch
laravel基於remote model思想實現快速服務化（入門篇）
2021-10-11
LaravelREM
LLM大模型: Segment Anything Model原理詳解
2024-11-04
大模型
業務領先模型（Business Leadership Model; BLM）
2021-11-16
模型
Attention模型方法綜述 | 多篇經典論文解讀
2018-06-11
模型
深度學習模型可解釋性初探
2018-10-13
深度學習模型
Adjacent List Model 與 Nested Set Model 兩種無限分類模型的對比
2019-02-28
模型
吳恩達《序列模型》課程筆記（3）– Sequence models & Attention mechanism
2018-08-02
吳恩達模型筆記
語言模型（五）—— Seq2Seq、Attention、Transformer學習筆記
2020-12-02
模型ORM筆記
有趣的大模型之我見 | Llama AI Model
2024-04-29
大模型AI
基於AlexNet和Inception模型思想的TFCNet模型設計與實現
2020-12-19
模型
Attention與SelfAttention
2023-03-17
【模型評估與選擇】sklearn.model_selection.KFold
2018-07-03
模型
擴散模型 (Diffusion Model) 之最全詳解圖解
2024-03-12
模型圖解

Attention Model（注意力模型）思想初探

1. Attention model簡介

0x1：AM是什麼

0x2：AM在機器翻譯中的應用

0x3：Attention Mechanism分類

1. hard: Attention和soft: Attention

2. Global Attention 和 Local Attention

3. Self Attention

2. 通過一個簡單的例子來理解attention model的思想原理

0x1：Dense Layer - 在DNN隱層中加入soft attention機制

0x2：LSTM/GRU Layer

1. Directly on the inputs (same as the Dense example above): APPLY_ATTENTION_BEFORE_LSTM = True

2. After the LSTM layer: APPLY_ATTENTION_BEFORE_LSTM = False

3. attention model在安全中有什麼作用？

0x1：包含惡意指令的正常檔案

相關文章

1. Directly on the inputs (same as the Dense example above): `APPLY_ATTENTION_BEFORE_LSTM = True`

2. After the LSTM layer: `APPLY_ATTENTION_BEFORE_LSTM = False`