CTR學習筆記&程式碼實現6-深度ctr模型後浪 xDeepFM/FiBiNET

風雨中的小七發表於2020-06-01

原文網址 : https://www.cnblogs.com/gogoSandy/p/13023265.html

xDeepFM用改良的DCN替代了DeepFM的FM部分來學習組合特徵資訊，而FiBiNET則是應用SENET加入了特徵權重比NFM，AFM更進了一步。在看兩個model前建議對DeepFM, Deep&Cross, AFM，NFM都有簡單瞭解，不熟悉的可以看下文章最後其他model的部落格連結。

以下程式碼針對Dense輸入更容易理解模型結構，針對spare輸入的程式碼和完整程式碼 ?
https://github.com/DSXiangLi/CTR

xDeepFM

模型結構

看xDeepFM的名字和DeepFM相似都擁有Deep和Linear的部分，只不過把DeepFM中用來學習二階特徵互動的FM部分替換成了CIN（Compressed Interactino Network）。而CIN是在Deep&Cross的DCN上進一步改良的得到。整體模型結構如下

CTR學習筆記&程式碼實現6-深度ctr模型後浪 xDeepFM/FiBiNET

我們重點看下CIN的部分，和paper的notation保持一致，有m個特徵，每個特徵Embedding是D維，第K層的CIN有\(H_k\)個unit。CIN第K層的計算分為3個部分分別對應圖a-c：

CTR學習筆記&程式碼實現6-深度ctr模型後浪 xDeepFM/FiBiNET

向量兩兩做element-wise product, 時間複雜度\(O(m*H_{k-1}*D)\)
對輸入層和第K-1層輸出，做element-wise乘積進行兩兩特徵互動，得到\(m*H_{k-1}\)個D維向量矩陣，如果CIN只有一層，則和FM, NFM，AFM的第一步相同。FM 直接聚合成scaler，NFM沿D進行sum_pooling，而AFM加入Attention沿D進行weighted_pooling。忽略batch的矩陣dimension變化如下

\[z^k = x^0 \odot x^{k-1} = (D * m* 1) \odot (D * 1* H_{k-1}) = D * m*H_{k-1} \]

Feature Map，空間複雜度\(O(H_k *H_{k-1} *m)\)，時間複雜度\(O(H_k *H_{k-1} *m*D)\)
\(W_k \in R^{H_{k-1}*m *H_k}\) 是第K層的權重向量，可以理解為沿Embedding做CNN。每個Filter對所有兩兩乘積的向量進行加權求和得到 \(1*D\)的向量一共有\(H_k\)個channel，輸出\(H_k * D\)的矩陣向量。

\[w^k \bullet z^k = (H_k *H_{k-1} *m)* (m*H_{k-1}*D) = H_k *D \]

Sum Pooling
CIN對每層的輸出沿Dimension進行sum pooling，得到\(H_k*1\)的輸出，然後把每層輸出concat以後作為CIN部分的輸出。

CIN每一層的計算如上，T層CIN每一層都是上一次層的輸出和第一層的輸入進行互動得到更高一階的互動資訊。假設每層維度一樣\(H_k=H\)， CIN 部分整體時間複雜度是\(O(TDmH^2)\)，空間複雜度來自每層的Filter權重\(O(TmH^2)\)

CIN保留DCN的任意高階和引數共享，兩個主要差別是

DCN是bit-wise，CIN是vector-wise。DCN在做向量乘積時不區分Field，直接對所有Field拼接成的輸入（m*D）進行外積。而CIN考慮Field，兩兩vector進行乘積
DCN使用了ResNet因為多項式的核心只用輸出最後一層，而CIN則是每層都進行pooling後輸出

CIN的設計還是很巧妙滴，不過。。。吐槽小分隊上線: CIN不論是時間複雜度還是空間複雜度都比DCN要高，感覺更容易過擬合。至於說vector-wise的向量乘積要比bit-wise的向量乘積要好，這。。。至少bit-wise可以不限制embedding維度一致, 但vector-wise嘛我實在有些理解無能，明白的童鞋可以comment一下

程式碼實現

def cross_op(xk, x0, layer_size_prev, layer_size_curr, layer, emb_size, field_size):
    # Hamard product: ( batch * D * HK-1 * 1) * (batch * D * 1* H0) -> batch * D * HK-1 * H0
    zk = tf.matmul( tf.expand_dims(tf.transpose(xk, perm = (0, 2, 1)), 3),
                    tf.expand_dims(tf.transpose(x0, perm = (0, 2, 1)), 2))

    zk = tf.reshape(zk, [-1, emb_size, field_size * layer_size_prev]) # batch * D * HK-1 * H0 -> batch * D * (HK-1 * H0)
    add_layer_summary('zk_{}'.format(layer), zk)

    # Convolution with channel = HK: (batch * D * (HK-1*H0)) * ((HK-1*H0) * HK)-> batch * D * HK
    kernel = tf.get_variable(name = 'kernel{}'.format(layer),
                             shape = (field_size * layer_size_prev, layer_size_curr))
    xkk = tf.matmul(zk, kernel)
    xkk = tf.transpose(xkk, perm = [0,2,1]) # batch * HK * D
    add_layer_summary( 'Xk_{}'.format(layer), xkk )
    return xkk


def cin_layer(x0, cin_layer_size, emb_size, field_size):
    cin_output_list = []

    cin_layer_size.insert(0, field_size) # insert field dimension for input
    with tf.variable_scope('Cin_component'):
        xk = x0
        for layer in range(1, len(cin_layer_size)):
            with tf.variable_scope('Cin_layer{}'.format(layer)):
                # Do cross
                xk = cross_op(xk, x0, cin_layer_size[layer-1], cin_layer_size[layer],
                              layer, emb_size, field_size ) # batch * HK * D
                # sum pooling on dimension axis
                cin_output_list.append(tf.reduce_sum(xk, 2)) # batch * HK

    return tf.concat(cin_output_list, axis=1)

@tf_estimator_model
def model_fn_dense(features, labels, mode, params):
    dense_feature, sparse_feature = build_features()
    dense_input = tf.feature_column.input_layer(features, dense_feature)
    sparse_input = tf.feature_column.input_layer(features, sparse_feature)

    # Linear part
    with tf.variable_scope('Linear_component'):
        linear_output = tf.layers.dense( sparse_input, units=1 )
        add_layer_summary( 'linear_output', linear_output )

    # Deep part
    dense_output = stack_dense_layer( dense_input, params['hidden_units'],
                               params['dropout_rate'], params['batch_norm'],
                               mode, add_summary=True )
    # CIN part
    emb_size = dense_feature[0].variable_shape.as_list()[-1]
    field_size = len(dense_feature)
    embedding_matrix = tf.reshape(dense_input, [-1, field_size, emb_size]) # batch * field_size * emb_size
    add_layer_summary('embedding_matrix', embedding_matrix)

    cin_output = cin_layer(embedding_matrix, params['cin_layer_size'], emb_size, field_size)

    with tf.variable_scope('output'):
        y = tf.concat([dense_output, cin_output,linear_output], axis=1)
        y = tf.layers.dense(y, units= 1)
        add_layer_summary( 'output', y )

    return y

FiBiNET

模型結構

看FiBiNET前可以先了解下Squeeze-and-Excitation Network,感興趣可以看下這篇部落格Squeeze-and-Excitation Networks。

FiBiNET的主要創新是應用SENET學習每個特徵的重要性，加權得到新的Embedding矩陣。在FiBiNET之前，AFM，PNN，DCN和上面的xDeepFM都是在特徵互動之後才用attention, 加權等方式學習特徵互動的權重，而FiBiNET在保留這部分的同時，在Embedding部分就考慮特徵自身的權重。模型結構如下

CTR學習筆記&程式碼實現6-深度ctr模型後浪 xDeepFM/FiBiNET

原始Embedding，和經過SENET調整過權重的新Embedding，在Bilinear-interaction層學習二階互動特徵，拼接後，再經過MLP進一步學習高階特徵。和paper notation保持一致（啊啊啊大家能不能統一下notation搞的我自己看自己的註釋都蒙圈），f個特徵，k維embedding

SENET層

SENET層學習每個特徵的權重對Embedding進行加權，分為以下3步

CTR學習筆記&程式碼實現6-深度ctr模型後浪 xDeepFM/FiBiNET

Squeeze
把\(f*k\)的Embedding矩陣壓縮成\(f*1\), 壓縮方式不固定，SENET原paper用的max_pooling,作者用的sum_pooling，感覺這裡壓縮方式應該取決於Embedding的資訊表達

\[\begin{align} E &= [e_1,...,e_f] \\ Z &= [z_1,...,z_f] \\ z_i &= F_{squeeze}(e_i) = \frac{1}{k}\sum_{i=1}^K e_i \\ \end{align} \]

Excitation
Excitation是一個兩層的全連線層，通過先降維再升維的方式過濾一些無用特徵，降維的幅度通過額外變數\(r\)來控制，第一層權重\(W_1 \in R^{f*f/r}\),第二層權重\(W_2 \in R^{f/r*f}\)。這裡r越高，壓縮的幅度越高，最終的權重會更集中,反之會更分散。

\[A = \sigma_2(W_2·\sigma_1(W_1·Z)) \]

Re-weight
最後一步就是用Excitation得到的每個特徵的權重對Embedding進行加權得到新Embedding

\[E_{new} = F_{Reweight}(A,E) = [a_1·e_1, ...,a_f·e_f ] \]

在收入資料集上進行嘗試，r=2時會有46%的embedding特徵權重為0，所以SENET會在特徵互動前先過濾部分對target無用的特徵來增加有效特徵的權重

CTR學習筆記&程式碼實現6-深度ctr模型後浪 xDeepFM/FiBiNET

Bilinear-Interaction層

作者提出內積和element-wise乘積都不足以捕捉特徵互動資訊，因此進一步引入權重W，以下面的方式進行特徵互動

\[v_i · W \odot v_j \]

CTR學習筆記&程式碼實現6-深度ctr模型後浪 xDeepFM/FiBiNET

其中W有三種選擇，可以所有特徵互動共享一個權重矩陣(Field-All),或者每個特徵和其他特徵的互動共享權重(Field-Each), 再或者每個特徵互動一個權重(Field-Interaction) 具體的優劣感覺需要casebycase來試，不過一般還是照著資料越少引數越少的邏輯來整。

原始Embedding和調整權重後的Embedding在Bilinear-Interaction學習互動特徵後，拼接成shallow 層，再經過全連線層來學習更高階的特徵互動。後面的屬於常規操作這裡就不再細說。

我們不去吐槽FiBiNET可以加入wide&deep框架來捕捉低階特徵資訊和任意高階資訊，更多把FiBiNET提供的SENET特徵權重的思路放到自己的工具箱中就好。

程式碼實現

def Bilinear_layer(embedding_matrix, field_size, emb_size, type, name):
    # Bilinear_layer: combine inner and element-wise product
    interaction_list = []
    with tf.variable_scope('BI_interaction_{}'.format(name)):
        if type == 'field_all':
            weight = tf.get_variable( shape=(emb_size, emb_size), initializer=tf.truncated_normal_initializer(),
                                      name='Bilinear_weight_{}'.format(name) )
        for i in range(field_size):
            if type == 'field_each':
                weight = tf.get_variable( shape=(emb_size, emb_size), initializer=tf.truncated_normal_initializer(),
                                          name='Bilinear_weight_{}_{}'.format(i, name) )
            for j in range(i+1, field_size):
                if type == 'field_interaction':
                    weight = tf.get_variable( shape=(emb_size, emb_size), initializer=tf.truncated_normal_initializer(),
                                          name='Bilinear_weight_{}_{}_{}'.format(i,j, name) )
                vi = tf.gather(embedding_matrix, indices = i, axis =1, batch_dims =0, name ='v{}'.format(i)) # batch * emb_size
                vj = tf.gather(embedding_matrix, indices = j, axis =1, batch_dims =0, name ='v{}'.format(j)) # batch * emb_size
                pij = tf.matmul(tf.multiply(vi,vj), weight) # bilinear : vi * wij \odot vj
                interaction_list.append(pij)

        combination = tf.stack(interaction_list, axis =1 ) # batch * emb_size * (Field_size * (Field_size-1)/2)
        combination = tf.reshape(combination, shape = [-1, int(emb_size * (field_size * (field_size-1) /2)) ]) # batch * ~
        add_layer_summary( 'bilinear_output', combination )

    return combination


def SENET_layer(embedding_matrix, field_size, emb_size, pool_op, ratio):
    with tf.variable_scope('SENET_layer'):
        # squeeze embedding to scaler for each field
        with tf.variable_scope('pooling'):
            if pool_op == 'max':
                z = tf.reduce_max(embedding_matrix, axis=2) # batch * field_size * emb_size -> batch * field_size
            else:
                z = tf.reduce_mean(embedding_matrix, axis=2)
            add_layer_summary('pooling scaler', z)

        # excitation learn the weight of each field from above scaler
        with tf.variable_scope('excitation'):
            z1 = tf.layers.dense(z, units = field_size//ratio, activation = 'relu')
            a = tf.layers.dense(z1, units= field_size, activation = 'relu') # batch * field_size
            add_layer_summary('exciitation weight', a )

        # re-weight embedding with weight
        with tf.variable_scope('reweight'):
            senet_embedding = tf.multiply(embedding_matrix, tf.expand_dims(a, axis = -1)) # (batch * field * emb) * ( batch * field * 1)
            add_layer_summary('senet_embedding', senet_embedding) # batch * field_size * emb_size

        return senet_embedding

@tf_estimator_model
def model_fn_dense(features, labels, mode, params):
    dense_feature, sparse_feature = build_features()
    dense_input = tf.feature_column.input_layer(features, dense_feature)
    sparse_input = tf.feature_column.input_layer(features, sparse_feature)

    # Linear part
    with tf.variable_scope('Linear_component'):
        linear_output = tf.layers.dense( sparse_input, units=1 )
        add_layer_summary( 'linear_output', linear_output )

    field_size = len(dense_feature)
    emb_size = dense_feature[0].variable_shape.as_list()[-1]
    embedding_matrix = tf.reshape(dense_input, [-1, field_size, emb_size])

    # SENET_layer to get new embedding matrix
    senet_embedding_matrix = SENET_layer(embedding_matrix, field_size, emb_size,
                                         pool_op = params['pool_op'], ratio= params['senet_ratio'])

    # combination layer & BI_interaction
    BI_org = Bilinear_layer(embedding_matrix, field_size, emb_size, type = params['bilinear_type'], name = 'org')
    BI_senet = Bilinear_layer(senet_embedding_matrix, field_size, emb_size, type = params['bilinear_type'], name = 'senet')

    combination_layer = tf.concat([BI_org, BI_senet] , axis =1)

    # Deep part
    dense_output = stack_dense_layer(combination_layer, params['hidden_units'],
                               params['dropout_rate'], params['batch_norm'],
                               mode, add_summary=True )

    with tf.variable_scope('output'):
        y = dense_output + linear_output
        add_layer_summary( 'output', y )

    return y

CTR學習筆記&程式碼實現系列?

https://github.com/DSXiangLi/CTR
CTR學習筆記&程式碼實現1-深度學習的前奏 LR->FFM
CTR學習筆記&程式碼實現2-深度ctr模型 MLP->Wide&Deep
CTR學習筆記&程式碼實現3-深度ctr模型 FNN->PNN->DeepFM
CTR學習筆記&程式碼實現4-深度ctr模型 NFM/AFM
CTR學習筆記&程式碼實現5-深度ctr模型 DeepCrossing -> Deep&Cross

Ref

Jianxun Lian, 2018, xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems
Tongwen Huang, 2019, FiBiNET: Combining Feature Importance and Bilinear feature Interaction for Click-Through Rate Prediction
Jie Hu, 2017, Squeeze-and-Excitation Networks
https://zhuanlan.zhihu.com/p/72931811
https://zhuanlan.zhihu.com/p/79659557
https://zhuanlan.zhihu.com/p/57162373
https://github.com/qiaoguan/deep-ctr-prediction

CTR學習筆記&程式碼實現5-深度ctr模型 DeepCrossing -> DCN
2020-05-15
筆記模型ROS
從FM推演各深度CTR預估模型(附程式碼)
2018-07-16
模型
從FM推演各深度CTR預估模型（附開原始碼）
2018-07-21
模型原始碼
頂會中深度學習用於CTR預估的論文及程式碼集錦 (3)
2019-08-15
深度學習
頂會中深度學習用於CTR預估的論文及程式碼集錦 (1)
2019-08-05
深度學習
原生javasript實現Ctr+c複製Ctr+v貼上
2018-09-04
Java
資深演算法專家解讀CTR預估業務中的深度學習模型
2018-07-12
演算法深度學習模型
【深度學習】TensorFlow實現線性迴歸，程式碼演示。全md文件筆記（程式碼文件已分享）
2024-02-27
深度學習筆記
Pytest學習筆記6-自定義標記mark
2021-06-29
筆記
【深度學習】深度學習md筆記總結第1篇：深度學習課程,要求【附程式碼文件】
2024-03-19
深度學習筆記
Python學習筆記—程式碼
2019-02-16
Python筆記
深度學習筆記一
2018-03-09
深度學習筆記
深度學習keras筆記
2020-12-17
深度學習Keras筆記
Laravel核心程式碼學習 — 模型關聯底層程式碼實現
2019-03-02
Laravel模型
Laravel核心程式碼學習 -- 模型關聯底層程式碼實現
2018-06-01
Laravel模型
深度學習框架Pytorch學習筆記
2023-02-27
深度學習框架PyTorch筆記
深度學習 DEEP LEARNING 學習筆記（一）
2020-07-24
深度學習筆記
深度學習 DEEP LEARNING 學習筆記（二）
2020-07-24
深度學習筆記
實戰 | 基於深度學習模型VGG的影象識別（附程式碼）
2018-03-30
深度學習模型
SpringMVC學習筆記6-指定處理請求型別
2020-11-25
SpringMVC筆記型別
【學習筆記】初次學習斜率最佳化的程式碼及筆記
2024-05-05
筆記
阿里推出DeepInsight平臺：視覺化理解深度神經網路CTR預估模型
2018-07-09
阿里視覺化神經網路模型
深度學習——loss函式的學習筆記
2020-10-13
深度學習函式筆記
【深度學習筆記】Batch Normalization (BN)
2019-01-07
深度學習筆記BATORM
深度學習模型
2018-12-07
深度學習模型
深度學習趣談：什麼是遷移學習？（附帶Tensorflow程式碼實現）
2020-07-17
深度學習遷移學習
【深度學習】大牛的《深度學習》筆記，Deep Learning速成教程
2018-04-07
深度學習筆記
MURF3040CTR-ASEMI智慧AI應用MURF3040CTR
2024-06-22
AI
Python深度學習（使用 Keras 回撥函式和 TensorBoard 來檢查並監控深度學習模型）--學習筆記（十六）
2020-11-16
Python深度學習Keras函式ORB模型筆記
深度學習後門攻擊分析與實現（二）
2024-09-27
深度學習
深度學習後門攻擊分析與實現（一）
2024-09-19
深度學習
spring原始碼學習筆記之容器的基本實現(一)
2021-02-05
Spring原始碼筆記
深度學習模型調優方法（Deep Learning學習記錄）
2020-08-05
深度學習模型
使用Python實現深度學習模型：序列到序列模型（Seq2Seq）
2024-06-06
Python深度學習模型
jQuery 學習筆記：jQuery 程式碼結構
2018-05-07
jQuery筆記
JPG學習筆記2（附完整程式碼）
2021-02-13
筆記
JPG學習筆記1（附完整程式碼）
2021-02-12
筆記
JPG學習筆記3（附完整程式碼）
2021-02-18
筆記

CTR學習筆記&程式碼實現6-深度ctr模型 後浪 xDeepFM/FiBiNET

xDeepFM

模型結構

程式碼實現

FiBiNET

模型結構

SENET層

Bilinear-Interaction層

程式碼實現

CTR學習筆記&程式碼實現系列?

相關文章

CTR學習筆記&程式碼實現6-深度ctr模型後浪 xDeepFM/FiBiNET