論文原址:https://arxiv.org/pdf/1606.07792.pdf
Wide & Deep Learning for Recommender Systems
-
Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra,
-
Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, Rohan Anil,
-
Zakaria Haque, Lichan Hong, Vihan Jain, Xiaobing Liu, Hemal Shah
-
∗ Google Inc.
ABSTRACT/摘要
Generalized linear models with nonlinear feature transformations are widely used for large-scale regression and classification problems with sparse inputs. Memorization of feature interactions through a wide set of cross-product feature transformations are effective and interpretable, while generalization requires more feature engineering effort. With less feature engineering, deep neural networks can generalize better to unseen feature combinations through low-dimensional dense embeddings learned for the sparse features. However, deep neural networks with embeddings can over-generalize and recommend less relevant items when the user-item interactions are sparse and high-rank. In this paper, we present Wide & Deep learning—jointly trained wide linear models and deep neural networks—to combine the benefits of memorization and generalization for recommender systems. We productionized and evaluated the system on Google Play, a commercial mobile app store with over one billion active users and over one million apps. Online experiment results show that Wide & Deep significantly increased app acquisitions compared with wide-only and deep-only models. We have also open-sourced our implementation in TensorFlow.
具有非線性特徵變換的廣義線性模型廣泛用於稀疏輸入的大規模迴歸和分類問題。通過廣泛的跨產品特徵轉換來記憶特徵互動是有效的和可解釋的,而概括需要更多的特徵工程工作。在較少要素工程的情況下,deep neural networks 可以通過為稀疏要素學習的低維密集嵌入來更好地概括為看不見的特徵組合。但是,當使用者項互動稀疏且等級較高時,具有嵌入的深層神經網路可以過度概括和推薦相關性較低的專案。在本文中,我們介紹了廣深學習——聯合訓練的寬線性模型和深度神經網路——將推薦系統記憶和概括的好處結合起來。我們在 Google Play 上生產並評估了該系統,這是一個商業移動應用商店,擁有超過 10 億活躍使用者和 100 多萬個應用程式。線上實驗結果表明,與僅寬度和僅深度模型相比,寬和深綜合模型顯著提高了應用的獲取量。我們還在 TensorFlow 中開源了我們的方法。
CCS Concepts/CCS 概念
- •Computing methodologies → Machine learning; Neural networks; Supervised learning;
- •Information systems → Recommender systems;
- Keywords
- Wide & Deep Learning, Recommender Systems.
1. INTRODUCTION/介紹
A recommender system can be viewed as a search ranking system, where the input query is a set of user and contextual information, and the output is a ranked list of items. Given a query, the recommendation task is to find the relevant items in a database and then rank the items based on certain objectives, such as clicks or purchases.
推薦系統可以被看作為搜尋排名系統,其中輸出是一組使用者和上下文的資訊,使用者和上下文資訊,輸出是每一項的排名列表。給定查詢,建議任務是查詢資料庫中的相關專案,然後根據特定目標(如點選或購買)對專案進行排名。
One challenge in recommender systems, similar to the general search ranking problem, is to achieve both memorization and generalization. Memorization can be loosely defined as learning the frequent co-occurrence of items or features and exploiting the correlation available in the historical data. Generalization, on the other hand, is based on transitivity of correlation and explores new feature combinations that
have never or rarely occurred in the past. Recommendations based on memorization are usually more topical and directly relevant to the items on which users have already performed actions. Compared with memorization, generalization tends to improve the diversity of the recommended items. In this paper, we focus on the apps recommendation problem for the Google Play store, but the approach should apply to generic recommender systems.
推薦系統的一個挑戰,和一般的搜尋排名系統類似,是實現兩者 memorization 和 generalization 。Memorization可以粗略地定義為學習專案或特徵的頻繁共發生,並利用歷史資料中可用的關聯。另一方面,Generalization基於相關性的傳遞性,並探索了新的特徵組合,過去從未或很少發生。基於Memorization的建議通常更熱門,並且與使用者已經執行的操作的專案直接相關。與Memorization相比,Generalization傾向於提高推薦專案的多樣性。在本文中,我們重點介紹 Google Play 商店的應用推薦問題,但該方法應適用於通用推薦系統。
For massive-scale online recommendation and ranking systems in an industrial setting, generalized linear models such as logistic regression are widely used because they are simple, scalable and interpretable.
對於在產業化環境中的大規模的線上推薦和排名系統,廣義線性模型(如邏輯迴歸)被廣泛使用,因為它們簡單、可擴充套件且可解釋。
The models are often trained on binarized sparse features with one-hot encoding.
E.g., the binary feature “user_installed_app=netflix” has value 1 if the user installed Netflix. Memorization can be achieved effectively using cross-product transformations over sparse features, such as AND(user_installed_app=netflix, impression_app=pandora”), whose value is 1 if the user installed Netflix and then is later shown Pandora. This explains how the co-occurrence of a feature pair correlates with the target label.
Generalization can be added by using features that are less granular, such as AND(user_installed_category=video, impression_category=music), but manual feature engineering is often required. One limitation of cross-product transformations is that they do not generalize to query-item feature pairs that have not appeared in the training data.
模型通常使用one-hot encoding在二進位制的稀疏特徵上進行訓練。
例如: 使用者安裝Netflix
,則二進位制功能user_installed_app_netflix
具有值1。Memorization可以通過跨產品對稀疏的特徵如AND(user_installed_app=netflix, impression_app=pandora”)
的轉換來實現。這個解釋瞭如何通過目標標籤對匹配同時出現的特徵值。
廣義化Generalization可以通過粒度較低的特徵值來新增,例如AND(user_installed_category=video, impression_category=music)
,但是通常需要認未的特徵製造。跨產品轉換的一個限制是它們不概括到未出現在訓練資料中的查詢項的特徵對。
Embedding-based models, such as factorization machines [5] or deep neural networks, can generalize to previously unseen query-item feature pairs by learning a low-dimensional dense embedding vector for each query and item feature, with less burden of feature engineering.
However, it is difficult to learn effective low-dimensional representations for queries and items when the underlying query-item matrix is sparse and high-rank, such as users with specific preferences or niche items with a narrow appeal. In such cases, there should be no interactions between most query-item pairs, but dense embeddings will lead to nonzero predictions for all query-item pairs, and thus can over-generalize and make less relevant recommendations. On the other hand, linear models with cross-product feature transformations can memorize these “exception rules” with much fewer parameters.
嵌入式的模型,例如 分解機、深度學習網路,可以通過學習每個查詢和項特徵的低維密集嵌入向量來概括到以前看不到的查詢項特徵對,從而減特徵值處理的負擔。
但是,當基礎查詢項矩陣稀疏且排名較高(例如具有特定偏好的使用者或吸引力較小的利基專案)時,很難學習查詢和項的有效低維表示形式。在這種情況下,大多數查詢項對之間不應有互動,但密集嵌入將導致對所有查詢項對進行非零預測,從而可以過度概括和提出不太相關的建議。另一方面,具有跨產品特徵轉換的線性模型可以記住這些"異常規則",而引數要少得多。
In this paper, we present the Wide & Deep learning framework to achieve both memorization and generalization in one model, by jointly training a linear model component and a neural network component as shown in Figure 1.
本論文通過如下所示的 Figure 1(該論文主要的貢獻) 聯合訓練線性模型元件和神經網路元件,在一個廣度和深度綜合學習框架中通過一個模型實現memorization記憶化 和 generalization廣泛化 。
Figure 1: The spectrum of Wide & Deep models.
The main contributions of the paper include:
• The Wide & Deep learning framework for jointly training linear model with feature transformations for generic recommender systems with sparse inputs.
• The implementation and evaluation of the Wide & Deep recommender system productionized on Google Play, a mobile app store with over one billion active users and over one million apps.
We have open-sourced our implementation along with a high-level API in TensorFlow
feed-forward neural networks with embeddings and While the idea is simple, we show that the Wide & Deep framework significantly improves the app acquisition rate on the mobile app store, while satisfying the training and serving speed requirements.
該檔案的主要貢獻包括:
- 寬和深度學習框架,用於共同訓練線性模型和具有稀疏輸入的通用推薦系統的特徵轉換。
- Wide&Deep推薦系統的實施和評估是在Google Play(一家擁有超過10億活躍使用者和超過100萬個應用程式的移動應用程式商店)上生產的。
我們在TensorFlow中將我們的實現以及高階API開源了
具有嵌入和的前饋神經網路雖然想法很簡單,但我們證明Wide&Deep框架顯著提高了移動應用商店中的應用獲取率,同時滿足了培訓和服務速度方面的要求。
2. RECOMMENDER SYSTEM OVERVIEW / 推薦系統總覽
An overview of the app recommender system is shown in Figure 2. A query, which can include various user and contextual features, is generated when a user visits the app store. The recommender system returns a list of apps (also referred to as impressions) on which users can perform certain actions such as clicks or purchases. These user actions, along with the queries and impressions, are recorded in the logs as the training data for the learner.
Figure 2: Overview of the recommender system.
應用程式推薦系統的概述如Figure 2所示。當使用者訪問應用程式商店時,將生成一個查詢,其中可能包含各種使用者和上下文功能。 推薦系統返回一個應用程式列表(也稱為展示次數),使用者可以在其上執行某些操作,例如點選或購買。 這些使用者操作以及查詢和印象將作為學習者的訓練資料記錄在日誌中。
Since there are over a million apps in the database, it isintractable to exhaustively score every app for every querywithin the serving latency requirements (oftenO(10) mil-liseconds). Therefore, the first step upon receiving a queryisretrieval. The retrieval system returns a short list of itemsthat best match the query using various signals, usually acombination of machine-learned models and human-definedrules. After reducing the candidate pool, therankingsys-tem ranks all items by their scores. The scores are usuallyP(y|x), the probability of a user action labelygiven thefeaturesx, including user features (e.g., country, language,demographics), contextual features (e.g., device, hour of theday, day of the week), and impression features (e.g., app age,historical statistics of an app). In this paper, we focus on theranking model using the Wide & Deep learning framework.
由於資料庫中有超過一百萬個應用程式,因此在服務等待時間要求(通常為10(10毫秒)毫秒)內為每個查詢詳盡地為每個應用程式評分是很困難的。 因此,接收到查詢的第一步是檢索。 檢索系統使用各種訊號(通常是機器學習的模型和人工定義的規則的組合)返回簡短匹配的專案清單。 減少候選庫後,排名系統將所有專案按其得分進行排名。 分數通常為P(y | x),帶有特徵x的使用者動作的概率,包括使用者功能(例如,國家/地區,語言,人口統計資訊),上下文功能(例如,裝置,一天中的小時,星期幾)以及 印象功能(例如,應用程式年齡,應用程式的歷史統計資訊)。 在本文中,我們重點研究使用廣泛和深度學習框架的排名模型。
3. WIDE & DEEP LEARNING / 廣泛和深度學習
3.1 The Wide Component / 'wide'的元件
Figure 2: Overview of the recommender system.
The wide component is a generalized linear model of theformy=wTx+b, as illustrated in Figure 1 (left).yis theprediction,x= [x1,x2,...,xd] is a vector ofdfeatures,w=[w1,w2,...,wd] are the model parameters andbis the bias.The feature set includes raw input features and transformed features. One of the most important transformations is thecross-product transformation, which is defined as:
The wide component是形式= wTx + b的廣義線性模型,如圖1(左)所示?。
預測,x = [x1,x2,...,xd]是特徵的向量,w = [w1, w2,...,wd]是模型引數,並且是偏差。特徵集包括原始輸入特徵和變換特徵。 最重要的轉換之一是跨產品轉換,其定義為:
where cki is a boolean variable that is 1 if thei-th fea-ture is part of thek-th transformationφk, and 0 otherwise.For binary features, a cross-product transformation (e.g.,“AND(gender=female,language=en)”) is 1 if and only if theconstituent features (“gender=female” and “language=en”)are all 1, and 0 otherwise. This captures the interactionsbetween the binary features, and adds nonlinearity to thegeneralized linear model.
其中cki是一個布林變數,如果第i個特徵是第k個變換φk的一部分,則為1,否則為0。對於二進位制特徵,請進行叉積變換(例如,“ AND(gender = female,language = en )”)僅當構成要素(“性別=女性”和“語言= zh-CN”)全部為1時為1,否則為0。 這捕獲了二元特徵之間的相互作用,併為廣義線性模型增加了非線性。
3.2 The Deep Component / 'deep'的元件
The deep component is a feed-forward neural network, asshown in Figure 1 (right). For categorical features, the orig-inal inputs are feature strings (e.g., “language=en”). Eachof these sparse, high-dimensional categorical features arefirst converted into a low-dimensional and dense real-valuedvector, often referred to as an embedding vector. The di-mensionality of the embeddings are usually on the order ofO(10) toO(100). The embedding vectors are initialized ran-domly and then the values are trained to minimize the finalloss function during model training. These low-dimensionaldense embedding vectors are then fed into the hidden layersof a neural network in the forward pass. Specifically, eachhidden layer performs the following computation:
The deep component是前饋神經網路,如圖1(右)所示?。
對於分類要素,原始輸入是要素字串(例如,“ language = en”)。 首先將這些稀疏的高維分類特徵轉換為低維且密集的實值向量,通常將其稱為嵌入向量。 嵌入的維度通常約為O(10)到O(100)。 隨機初始化嵌入向量,然後訓練值以最小化模型訓練期間的finalloss函式。 然後將這些低維密集嵌入向量饋入神經網路的隱藏層中。 具體而言,每個隱藏層執行以下計算:
wherelis the layer number andfis the activation function,often rectified linear units (ReLUs).a(l),b(l), andW(l)arethe activations, bias, and model weights atl-th layer.
其中層號和啟用函式通常為線性整流單元(ReLU)。a(l),b(l)和W(l)是第1層的啟用,偏差和模型權重。
3.3 Joint Training of Wide & Deep Model / 深度模型聯合訓練
The wide component and deep component are combined using a weighted sum of their output log odds as the prediction, which is then fed to one common logistic loss func-tion for joint training. Note that there is a distinction be-tweenjoint trainingandensemble. In an ensemble, indi-vidual models are trained separately without knowing eachother, and their predictions are combined only at inferencetime but not at training time. In contrast, joint trainingoptimizes all parameters simultaneously by taking both thewide and deep part as well as the weights of their sum intoaccount at training time. There are implications on modelsize too: For an ensemble, since the training is disjoint, eachindividual model size usually needs to be larger (e.g., withmore features and transformations) to achieve reasonableaccuracy for an ensemble to work. In comparison, for jointtraining the wide part only needs to complement the weak-nesses of the deep part with a small number of cross-productfeature transformations, rather than a full-size wide model.
The wide component and deep component使用其輸出對數比值的加權和作為預測進行組合,然後將其饋入一種常見的邏輯損失函式以進行聯合訓練。請注意,聯合訓練和綜合之間是有區別的。聯合的時候,單獨模型在不相互瞭解的情況下進行單獨訓練,並且它們的預測僅在推理時組合,而在訓練時不組合。相比之下,聯合訓練通過在訓練時同時考慮最寬和最深的部分以及其總和的權重來同時優化所有引數。模型大小也會產生影響:對於整體而言,由於訓練是不相交的,因此通常每個個體模型的大小都需要更大(例如具有更多特徵和變換),以使整體工作達到合理的準確性。相比之下,對於聯合訓練,僅需使用少量跨產品功能變換來補充較深部分的弱點,而不是完整尺寸的較寬模型。
Joint training of a Wide & Deep Model is done by back-propagating the gradients from the output to both the wideand deep part of the model simultaneously using mini-batchstochastic optimization. In the experiments, we used Follow-the-regularized-leader (FTRL) algorithm [3] withL1regu-larization as the optimizer for the wide part of the model,and AdaGrad [1] for the deep part.
通過使用最小批量隨機優化,同時從輸出向模型的寬和深部分同時傳播梯度,可以完成寬深模型的優化訓練。 在實驗中,我們使用具有L1規則化的跟隨規則化領導(FTRL)演算法[3]作為模型大部分的優化器,並使用AdaGrad [1]作為模型的最佳部分。
The combined model is illustrated in Figure 1 (center). For a logistic regression problem, the model’s prediction is:
組合模型如圖1所示(中心)?。
對於邏輯迴歸問題,模型的預測為:
where Y is the binary class label,
σ(·) is the sigmoid func-tion,
φ(x) are the cross product transformations of the orig-inal featuresx,
and b is the bias term.
wwide is the vector ofall wide model weights,
and wdeepare the weights appliedon the final activations a(lf).
其中Y是二進位制類標籤,
σ(·)是S形函式,
φ(x)是原始特徵x的叉積變換,
,並且b是偏差項。
wwide是所有模型權重的向量,
和wdeep是施加在最終啟用a(lf)上的權重。
4. SYSTEM IMPLEMENTATION
The implementation of the apps recommendation pipeline consists of three stages: data generation, model training, and model serving as shown in Figure 3.
應用程式推薦傳遞途徑的實現包括三個階段:資料生成,模型訓練和模型服務,如Figure 3所示。
Figure 3: Apps recommendation pipeline overview.
4.1 Data Generation
In this stage, user and app impression data within a period of time are used to generate training data. Each example corresponds to one impression. The label is app acquisition: 1 if the impressed app was installed, and 0 otherwise.
在此階段,一段時間內的使用者和應用印象資料將用於生成訓練資料。 每個示例對應一個印象。 標籤為app acquisition:如果此程式被安裝,則為1;否則為0。
Vocabularies, which are tables mapping categorical fea-ture strings to integer IDs, are also generated in this stage.The system computes the ID space for all the string featuresthat occurred more than a minimum number of times. Con-tinuous real-valued features are normalized to [0,1] by map-ping a feature valuexto its cumulative distribution functionP(X≤x), divided intonqquantiles. The normalized valueisi−1nq−1for values in thei-th quantiles. Quantile boundaries are computed during data generation.
在此階段還會生成詞彙表,這些表是將分類功能字串對映為整數ID的表。系統會為所有出現最少次數的字串特徵計算ID空間。 通過將特徵值x對映到其累積分佈函式P(X≤x)(除以intonqquantiles),可以將連續的實值特徵標準化為[0,1]。 第i個分位數中的值的歸一化值。 在資料生成期間計算分位數邊界。
Figure 4: Wide & Deep model structure for apps recommendation.
4.2 Model Training
The model structure we used in the experiment is shown in Figure 4. During training, our input layer takes in training data and vocabularies and generate sparse and dense features together with a label. The wide component consists of the cross-product transformation of user installed apps and impression apps. For the deep part of the model, A 32dimensional embedding vector is learned for each categorical feature. We concatenate all the embeddings together with the dense features, resulting in a dense vector of approximately 1200 dimensions. The concatenated vector is then fed into 3 ReLU layers, and finally the logistic output unit.
我們在實驗中使用的模型結構如圖4所示。在訓練期間,我們的輸入層接收訓練資料和詞彙,並生成稀疏和密集特徵以及標籤。 廣泛的元件包括使用者安裝的應用程式和展示應用程式的跨產品轉換。 對於模型的深層部分,將為每個分類特徵學習32維嵌入向量。 我們將所有嵌入與密集特徵連線在一起,從而得到大約1200維的密集向量。 然後將級聯的向量輸入3個ReLU層,最後輸入邏輯輸出單元。
The Wide & Deep models are trained on over 500 billion examples. Every time a new set of training data arrives, the model needs to be re-trained. However, retraining from scratch every time is computationally expensive and delays the time from data arrival to serving an updated model. To tackle this challenge, we implemented a warm-starting system which initializes a new model with the embeddings and the linear model weights from the previous model.
wide 和 deep 聯合模型已針對超過5,000億個示例進行了訓練。 每次收到一組新的訓練資料時,都需要對模型進行重新訓練。 但是,每次從頭開始的重新訓練在計算上都是昂貴的,並且會延遲從資料到達到提供更新模型的時間。 為了應對這一挑戰,我們實施了熱啟動系統,該系統使用嵌入的模型和先前模型的線性模型權重來初始化新模型。
Before loading the models into the model servers, a dry run of the model is done to make sure that it does not cause problems in serving live traffic. We empirically validate the model quality against the previous model as a sanity check.
在將模型載入到模型伺服器之前,需要對模型進行空執行以確保它不會在提供實時流量方面引起問題。 我們根據以前的模型經驗性地驗證模型質量,以進行完整性檢查。
4.3 Model Serving
Once the model is trained and verified, we load it into the model servers. For each request, the servers receive a set of app candidates from the app retrieval system and user features to score each app. Then, the apps are ranked from the highest scores to the lowest, and we show the apps to the users in this order. The scores are calculated by running a forward inference pass over the Wide & Deep model.
對模型進行訓練和驗證後,我們會將其載入到模型伺服器中。 對於每個請求,伺服器從應用程式檢索系統和使用者功能部件接收一組應用程式候選者,以對每個應用程式進行評分。 然後,從最高得分到最低得分對應用程式進行排名,然後按順序將這些應用程式顯示給使用者。 通過在Wide&Deep模型上執行前向推理過程來計算分數。
In order to serve each request on the order of 10 ms, we optimized the performance using multithreading parallelism by running smaller batches in parallel, instead of scoring all candidate apps in a single batch inference step.
為了滿足10毫秒量級的每個請求,我們通過多執行緒並行執行來優化效能,方法是並行執行較小的批處理,而不是在單個批處理推理步驟中對所有候選應用程式評分。
5. EXPERIMENT RESULTS
To evaluate the effectiveness of Wide & Deep learning in a real-world recommender system, we ran live experiments and evaluated the system in a couple of aspects: app acquisitions and serving performance.
為了評估現實世界推薦系統中的廣泛學習和深度學習的有效性,我們進行了現場實驗,並從兩個方面評估了該系統:應用程式獲取和服務效能。
5.1 App Acquisitions
We conducted live online experiments in an A/B testing framework for 3 weeks. For the control group, 1% of users were randomly selected and presented with recommendations generated by the previous version of ranking model, which is a highly-optimized wide-only logistic regression model with rich cross-product feature transformations. For the experiment group, 1% of users were presented with recommendations generated by the Wide & Deep model, trained with the same set of features. As shown in Table 1, Wide & Deep model improved the app acquisition rate on the main landing page of the app store by +3.9% relative to the control group (statistically significant). The results were also compared with another 1% group using only the deep part of the model with the same features and neural network structure, and the Wide & Deep mode had +1% gain on top of the deep-only model (statistically significant).
我們在A / B測試框架中進行了3周的線上實時實驗。 對於對照組,隨機選擇了1%的使用者,並向其提供由先前版本的排名模型生成的推薦,該排名模型是高度優化的全範圍邏輯迴歸模型,具有豐富的跨產品特徵變換。 對於實驗組,向1%的使用者展示了由Wide&Deep模型產生的建議,並接受了相同的功能集培訓。 如表1所示,
Wide&Deep模型相對於對照組(具有統計學意義)將應用商店主登陸頁面上的應用獲取率提高了+ 3.9%。 還僅使用具有相同特徵和神經網路結構的模型的深層部分,將結果與另一個1%的小組進行了比較,寬和深層模式在僅深層模型之上具有+ 1%的增益(具有統計意義) 。
Besides online experiments, we also show the Area Under Receiver Operator Characteristic
Curve (AUC) on a holdout set offline. While Wide & Deep has a slightly higher offline AUC, the impact is more significant on online traffic. One possible reason is that the impressions and labels in offline data sets are fixed, whereas the online system can generate new exploratory recommendations by blending generalization with memorization, and learn from new user responses.
除了線上實驗,我們還顯示了接收器操作員區域下的特徵
離線設定的保持線上的曲線(AUC)。 雖然Wide&Deep的離線AUC略高,但對線上流量的影響更大。 一種可能的原因是,離線資料集中的印象和標籤是固定的,而線上系統可以通過將概括與記憶相結合來生成新的探索性建議,並從新的使用者響應中學習。
5.2 Serving Performance
Serving with high throughput and low latency is challenging with the high level of traffic faced by our commercial mobile app store. At peak traffic, our recommender servers score over 10 million apps per second. With single threading, scoring all candidates in a single batch takes 31 ms. We implemented multithreading and split each batch into smaller sizes, which significantly reduced the client-side latency to 14 ms (including serving overhead) as shown in Table 2.
商業移動應用商店面臨高流量,以高吞吐量和低延遲提供服務是一項挑戰。 在流量高峰時,我們的推薦伺服器每秒可記錄超過1000萬個應用程式。 使用單執行緒時,對單個批處理中的所有候選者評分需要31毫秒。 我們實施了多執行緒,並將每個批處理分成較小的大小,這將客戶端延遲顯著減少到14 ms(包括服務開銷),如表2所示。
6. RELATED WORK
The idea of combining wide linear models with crossproduct feature transformations and deep neural networks with dense embeddings is inspired by previous work, such as factorization machines [5] which add generalization to linear models by factorizing the interactions between two variables as a dot product between two low-dimensional embedding vectors. In this paper, we expanded the model capacity by learning highly nonlinear interactions between embeddings via neural networks instead of dot products.
將寬線性模型與叉積特徵變換以及具有密集嵌入的深層神經網路相結合的想法受到了以前的工作的啟發,例如因式分解機[5],該因式分解機通過將兩個變數之間的相互作用作為兩個點之間的乘積進行因式分解,從而對線性模型進行了泛化。 低維嵌入向量。 在本文中,我們通過神經網路而非點積學習嵌入之間的高度非線性相互作用,從而擴充套件了模型的功能。
In language models, joint training of recurrent neural networks (RNNs) and maximum entropy models with n-gram features has been proposed to significantly reduce the RNN complexity (e.g., hidden layer sizes) by learning direct weights between inputs and outputs [4]. In computer vision, deep residual learning [2] has been used to reduce the difficulty of training deeper models and improve accuracy with shortcut connections which skip one or more layers. Joint training of neural networks with graphical models has also been applied to human pose estimation from images [6]. In this work we explored the joint training of feed-forward neural networks
在語言模型中,已提出通過 n -gram特徵聯合訓練遞迴神經網路(RNN)和最大熵模型,以通過學習輸入和輸出之間的直接權重來顯著降低RNN的複雜性(例如,隱藏層大小)[4]。 ]。 在計算機視覺中,深度殘差學習[2]已被用來減少訓練更深層模型的難度並通過跳過一層或多層的快捷連線提高準確性。 神經網路與圖形模型的聯合訓練也已應用於影象的人體姿勢估計[6]。 在這項工作中,我們探索了前饋神經網路的聯合訓練
and linear models, with direct connections between sparse features and the output unit, for generic recommendation and ranking problems with sparse input data.
線性模型,在稀疏特徵和輸出單元之間具有直接連線,用於稀疏輸入資料的通用推薦和排名問題。
In the recommender systems literature, collaborative deep learning has been explored by coupling deep learning for content information and collaborative filtering (CF) for the ratings matrix [7]. There has also been previous work on mobile app recommender systems, such as AppJoy which used CF on users’ app usage records [8]. Different from the CF-based or content-based approaches in the previous work, we jointly train Wide & Deep models on user and impression data for app recommender systems.
在推薦系統文獻中,已經通過將內容資訊的深度學習與評級矩陣的協作過濾(CF)耦合來探索協作深度學習[7]。 以前在移動應用推薦系統上也有過工作,例如AppJoy,它在使用者的應用使用記錄中使用CF [8]。 與先前工作中基於CF或基於內容的方法不同,我們聯合針對應用推薦系統的使用者和展示資料訓練了Wide&Deep模型。
7. CONCLUSION
Memorization and generalization are both important for recommender systems. Wide linear models can effectively memorize sparse feature interactions using cross-product feature transformations, while deep neural networks can generalize to previously unseen feature interactions through lowdimensional embeddings. We presented the Wide & Deep learning framework to combine the strengths of both types of model. We productionized and evaluated the framework on the recommender system of Google Play, a massive-scale commercial app store. Online experiment results showed that the Wide & Deep model led to significant improvement on app acquisitions over wide-only and deep-only models.
記憶和概括對於推薦系統都很重要。 寬線性模型可以使用跨積特徵變換有效地記住稀疏特徵互動,而深度神經網路可以通過低維嵌入將其推廣到以前看不見的特徵互動。 我們提出了廣泛和深度學習框架,以結合兩種型別的模型的優勢。 我們在大型商業應用商店Google Play的推薦系統上製作並評估了該框架。 線上實驗結果表明,與“僅寬”和“僅深”模型相比,“寬和深”模型導致了應用獲取方面的顯著改善。
8. REFERENCES
[1] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, July 2011.
[2] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2016.
[3] H. B. McMahan. Follow-the-regularized-leader and mirror descent: Equivalence theorems and l1 regularization. In Proc. AISTATS, 2011.
[4] T. Mikolov, A. Deoras, D. Povey, L. Burget, and J. H. Cernocky. Strategies for training large scale neural network language models. In IEEE Automatic Speech Recognition & Understanding Workshop, 2011.
[5] S. Rendle. Factorization machines with libFM. ACM Trans. Intell. Syst. Technol., 3(3):57:1–57:22, May 2012.
[6] J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler. Joint training of a convolutional network and a graphical model for human pose estimation. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q.
Weinberger, editors, NIPS, pages 1799–1807. 2014.
[7] H. Wang, N. Wang, and D.-Y. Yeung. Collaborative deep learning for recommender systems. In Proc. KDD, pages 1235–1244, 2015.
[8] B. Yan and G. Chen. AppJoy: Personalized mobile application discovery. In MobiSys, pages 113–126, 2011.
[1] See Wide & Deep Tutorial on http://tensorflow.org.
本作品採用《CC 協議》,轉載必須註明作者和本文連結
文章!!首發於我的部落格Stray_Camel(^U^)ノ~YO。