如何為「紐約時報」開發基於內容的推薦系統

第四正規化先薦發表於2019-02-27

原文網址 : https://flycode.co/archives/285166

我們在幫助紐約時報（The New York Times，以下簡稱NYT）開發一套基於內容的推薦系統，大家可以把這套系統看作一個非常簡單的推薦系統開發示例。依託使用者近期的文章瀏覽資料，我們會為其推薦適合閱讀的新文章，而想做到這一點，只需以這篇文章的文字資料為基礎，推薦給使用者類似的內容。

資料檢驗 以下是資料集中第一篇NYT文章中的摘錄，我們已經做過文字處理。

'TOKYO — State-backed Japan Bank for International Cooperation [JBIC.UL] will lend about 4 billion yen ($39 million) to Russia's Sberbank, which is subject to Western sanctions, in the hope of advancing talks on a territorial dispute, the Nikkei business daily said on Saturday, [...]"

首先需要解決的問題是，該如何將這段內容向量化，並且設計諸如Parts-of-Speech 、N-grams 、sentiment scores 或 Named Entities等新特徵。

顯然NLP tunnel有深入研究的價值，甚至可以花費很多時間在既有方案上做實驗。但真正的科學往往是從試水最簡單可行的方案開始的，這樣後續的迭代才會愈加完善。

而在這篇文章中，我們就開始執行這個簡單可行的方案。

資料拆分 我們需要將標準資料進行預加工，方法是確定資料庫中符合要求的特徵，打亂順序，然後將這些特徵分別放入訓練和測試集。

# move articles to an array
articles = df.body.values

# move article section names to an array
sections = df.section_name.values

# move article web_urls to an array
web_url = df.web_url.values

# shuffle these three arrays 
articles, sections, web_ur = shuffle(articles, sections, web_url, random_state=4)

# split the shuffled articles into two arrays
n = 10

# one will have all but the last 10 articles -- think of this as your training set/corpus 
X_train = articles[:-n]
X_train_urls = web_url[:-n]
X_train_sections = sections[:-n]

# the other will have those last 10 articles -- think of this as your test set/corpus 
X_test = articles[-n:]
X_test_urls = web_url[-n:]
X_test_sections = sections[-n:]
複製程式碼

文字向量化系統 可以從Bag-of-Words(BoW)、Tf-Idf、Word2Vec等幾種不同的文字向量化系統中選擇。

我們選擇Tf-Idf的原因之一是，不同於BoW，Tf-Idf識別詞彙重要性的方式除文字頻率外，還包括逆文件頻率。

舉例，一個像“Obama”這樣的詞彙雖然在文章中僅出現幾次（不包括類似“a”、“the”這樣不能傳達太多資訊的詞彙），但出現在多篇不同的文章中，那麼就應該得到更高的權重值。

因為“Obama”既不是停用詞，也不是日常用語（即說明該詞彙與文章主題高度相關）。

相似性準則 確定相似性準則時有好幾種方案，比如將Jacard和Cosine做對比。

Jacard的實現依靠兩集之間的比較及重疊元素選擇。考慮到已選擇Tf-Idf作為文字向量化系統，作為選項，Jacard相似性並無意義。如果選擇BoWs向量化，可能Jacard可能才能發揮作用。

因此，我們嘗試將Cosine作為相似性準則。

從Tf-Idf為每篇文章中的每個標記分配權重開始，就能夠從不同文章標記的權重之間取點積了。

如果文章A中類似“Obama” 或者“White House”這樣的標記權重較高，並且文章B中也是如此，那麼相對於文章B中相同標記權重低的情況來說，兩者的相似性乘積將得出一個更大的數值。

建立推薦系統 根據使用者已讀文章和所有語料庫中的其他文章（即訓練資料）的相似性數值，現在你就可以建立一個輸出前N篇文章的函式，然後開始給使用者推薦了。

def get_top_n_rec_articles(X_train_tfidf, X_train, test_article, X_train_sections, X_train_urls, n = 5):
    '''This function calculates similarity scores between a document and a corpus
    
       INPUT: vectorized document corpus, 2D array
              text document corpus, 1D array
              user article, 1D array
              article section names, 1D array
              article URLs, 1D array
              number of articles to recommend, int
              
       OUTPUT: top n recommendations, 1D array
               top n corresponding section names, 1D array
               top n corresponding URLs, 1D array
               similarity scores bewteen user article and entire corpus, 1D array
              '''
    # calculate similarity between the corpus (i.e. the "test" data) and the user's article
    similarity_scores = X_train_tfidf.dot(test_article.toarray().T)
    # get sorted similarity score indices  
    sorted_indicies = np.argsort(similarity_scores, axis = 0)[::-1]
    # get sorted similarity scores
    sorted_sim_scores = similarity_scores[sorted_indicies]
    # get top n most similar documents
    top_n_recs = X_train[sorted_indicies[:n]]
    # get top n corresponding document section names
    rec_sections = X_train_sections[sorted_indicies[:n]]
    # get top n corresponding urls
    rec_urls = X_train_urls[sorted_indicies[:n]]
    
    # return recommendations and corresponding article meta-data
    return top_n_recs, rec_sections, rec_urls, sorted_sim_scores
複製程式碼

以下是該函式的執行步驟：

1.計算使用者文章和語料庫的相似性；

2.將相似性分值從高到低排序；

3.得出前N篇最相似的文章；

4.獲取對應前N篇文章的小標題及URL；

5.返回前N篇文章，小標題，URL和分值

結果驗證 現在我們已經可以根據使用者正在閱讀的內容，為他們推薦可供閱讀的文章來檢測結果是否可行了。

接下來讓我們將使用者文章和對應小標題與推薦文章和對應小標題作對比。

首先看一下相似性分值。

# similarity scores
sorted_sim_scores[:5]
# OUTPUT:
# 0.566
# 0.498
# 0.479
# .
# .
複製程式碼

Cosine相似度的取值範圍在0-1，由此可見其分值並不高。該如何提高分值呢？可以選擇類似Doc2Vec這樣不同的向量化系統，也可以換一個相似性準則。儘管如此，還是讓我們來看一下結果。

# user's article's section name
X_test_sections[k]
# OUTPUT:
'U.S'

# corresponding section names for top n recs 
rec_sections
# OUTPUT:
'World'
'U.S'
'World'
'World'
'U.S.'
複製程式碼

從結果可以看出，推薦的小標題是符合需要的。

#user's article X_test[k] 'LOS ANGELES — The White House says President Barack Obama has told the Defense Department that it must ensure service members instructed to repay enlistment bonuses are being treated fairly and expeditiously.\nWhite House spokesman Josh Earnest says the president only recently become aware of Pentagon demands that some soldiers repay their enlistment bonuses after audits revealed overpayments by the California National Guard. If soldiers refuse, they could face interest charges, wage garnishments and tax liens.\nEarnest says he did not believe the president was prepared to support a blanket waiver of those repayments, but he said "we're not going to nickel and dime" service members when they get back from serving the country. He says they should not be held responsible for fraud perpetrated by others.'

前五篇推薦文章都與讀者當前閱讀的文章相關，事實證明該推薦系統符合預期。

關於驗證的說明 通過比較推薦文字和小標題的ad-hoc驗證過程，表明我們的推薦系統可以按照要求正常執行。

人工驗證的效果還不錯，不過我們最終希望得到的是一個完全自動化的系統，以便於將其放入模型並自我驗證。

如何將該推薦系統放入模型不是本文的主題，本文旨在說明如何在真實資料集的基礎上設計這樣的推薦系統原型。

原文作者為資料科學家Alexander Barriga，由國內智慧推薦平臺先薦_個性化推薦專家編譯，部分有刪改，轉載請註明出處。

基於內容的推薦系統演算法
2023-03-10
演算法
實現基於內容的電影推薦系統—程式碼實現
2024-04-07
推薦召回--基於內容的召回：Content Based
2022-01-29
基於內容的電影推薦演算法研究
2024-04-11
演算法
基於thincmf的內容管理系統
2019-05-11
基於thinkphp5+layui開的CLTPHP內容管理系統
2019-05-11
PHPUI
KiteCms 是一個基於ThinkPHP5.0.9開發的開源內容管理系統
2019-05-11
PHP
知了 | 基於NLP的智慧問答推薦系統
2022-12-05
推薦一款基於nodejs+koa+vue開發的開源智慧物業系統
2021-11-13
NodeJSVue
推薦一個基於Springboot+Vue的開源部落格系統
2019-03-10
Spring BootVue
基於springboot的圖書個性化推薦系統
2024-04-19
Spring Boot
揭祕！阿里巴巴基於Transformer的推薦系統
2019-09-23
阿里ORM
使用協同濾波（Collaborative Filtering）實現內容推薦系統
2021-01-03
Filter
基於Java的網站內容管理系統(SpringBoot版)
2020-12-06
Java網站Spring Boot
基於指標管理系統建設的BI工具推薦
2023-03-28
指標
PredictionIO：開源的推薦系統
2018-10-16
推薦系統應該如何保障推薦的多樣性？
2019-11-13
如何構建推薦系統
2020-04-19
分期商城實時推薦系統
2018-12-29
網易雲音樂基於使用者的推薦系統
2019-03-04
推薦系統實踐 0x09 基於圖的模型
2020-12-01
模型
基於愛客猴內容管理系統開發的博文教育培訓學校網站
2019-05-11
網站
QuillCMS – 基於Nodejs、Nuxtjs、MongoDB構建內容管理系統
2019-02-16
UINodeJSUXMongoDB
基於thinkphp5.0的內容管理系統 NoneCms V1.1.0
2019-05-11
PHPNone
基於物件特徵的推薦
2018-12-12
物件特徵
推薦一個基於thinkphp6的通用後臺管理系統
2021-08-07
PHP
關於/合約跟單系統開發/合約量化系統開發原始碼功能/方案
2023-03-28
原始碼
Snap：如何加速推薦系統的特徵工程
2022-10-17
特徵工程
推薦系統基礎知識（二）
2020-12-06
第 1 章推薦系統的時代背景
2021-03-05
五星好評！基於uniapp開發的開源專案推薦
2020-09-29
APP
基於Hyperf + Vue + Element 開發的管理系統（內建聊天模組）
2022-02-10
Vue
推薦一款使用go開發的文件管理系統
2020-08-09
Go
基於 Golang 開發的分散式定時任務管理系統
2019-09-06
Golang分散式
PbootCMS內容列表只顯示推薦/置頂/頭條內容
2024-09-04
boot
分享一套基於thinkphp開發的小說內容管理系統原始碼，附安裝教程，100%開源。
2024-11-10
PHP原始碼
【推薦系統篇】--推薦系統之訓練模型
2018-03-26
模型
百度基於雲原生的推薦系統設計與實踐
2024-02-20

如何為「紐約時報」開發基於內容的推薦系統

相關文章