[python] LDA處理文件主題分佈程式碼入門筆記

CopperDong發表於2017-11-07

以前只知道LDA是個好東西，但自己並沒有真正去使用過。同時，關於它的文章也非常之多，推薦大家閱讀書籍《LDA漫遊指南》，最近自己在學習文件主題分佈和實體對齊中也嘗試使用LDA進行簡單的實驗。這篇文章主要是講述Python下LDA的基礎用法，希望對大家有所幫助。如果文章中有錯誤或不足之處，還請海涵~

一. 下載安裝

LDA推薦下載地址包括：其中前三個比較常用。
gensim下載地址：https://radimrehurek.com/gensim/models/ldamodel.html
pip install lda安裝地址：https://github.com/ariddell/lda
scikit-learn官網文件：LatentDirichletAllocation
其中sklearn的程式碼例子可參考下面這篇：
Topic extraction with NMF and Latent Dirichlet Allocation
其部分輸出如下所示，包括各個主體Topic包含的主題詞：

[plain]view
 plain copy

Loading dataset...  

Fitting LDA models with tf features, n_samples=2000 and n_features=1000...  

done in 0.733s.  

Topics in LDA model:  

Topic #0:  

000 war list people sure civil lot wonder say religion america accepted punishment bobby add liberty person kill concept wrong  

Topic #1:  

just reliable gods consider required didn war makes little seen faith default various civil motto sense currency knowledge belief god  

Topic #2:  

god omnipotence power mean rules omnipotent deletion policy non nature suppose definition given able goal nation add place powerful leaders  

....

下面這三個也不錯，大家有時間的可以見到看看：
 https://github.com/arongdari/python-topic-model
 https://github.com/shuyo/iir/tree/master/lda
 https://github.com/a55509432/python-LDA
其中第三個作者a55509432的我也嘗試用過，模型輸出檔案為：

model_parameter.dat 儲存模型訓練時選擇的引數 
wordidmap.dat 儲存詞與id的對應關係，主要用作topN時查詢 
model_twords.dat 輸出每個類高頻詞topN個 
model_tassgin.dat 輸出文章中每個詞分派的結果，文字格式為詞id:類id 
model_theta.dat 輸出文章與類的分佈概率，文字一行表示一篇文章，概率1 概率2..表示文章屬於類的概率 
model_phi.dat 輸出詞與類的分佈概率，是一個K*M的矩陣，K為設定分類的個數，M為所有文章的詞的總數

但是短文字資訊還行，但使用大量文字內容時，輸出文章與類分佈概率幾乎每行資料存在大量相同的，可能程式碼還存在BUG。

下面是介紹使用pip install lda安裝過程及程式碼應用：

[python]view
 plain copy

pip install lda  

參考：[python] 安裝numpy+scipy+matlotlib+scikit-learn及問題解決

二. 官方文件

這部分內容主要參考下面幾個連結，強推大家去閱讀與學習：
官網文件：https://github.com/ariddell/lda
  lda: Topic modeling with latent Dirichlet Allocation
  Getting started with Latent Dirichlet Allocation in Python - sandbox
  [翻譯] 在Python中使用LDA處理文字 - letiantian
  文字分析之TFIDF/LDA/Word2vec實踐 - vs412237401

1.載入資料

[python]view
 plain copy

import numpy as np  

import lda  

import lda.datasets  

# document-term matrix  

X = lda.datasets.load_reuters()  

print("type(X): {}".format(type(X)))  

print("shape: {}\n".format(X.shape))  

print(X[:5, :5])  

# the vocab  

vocab = lda.datasets.load_reuters_vocab()  

print("type(vocab): {}".format(type(vocab)))  

print("len(vocab): {}\n".format(len(vocab)))  

print(vocab[:5])  

# titles for each story  

titles = lda.datasets.load_reuters_titles()  

print("type(titles): {}".format(type(titles)))  

print("len(titles): {}\n".format(len(titles)))  

print(titles[:5])

載入LDA包資料集後，輸出如下所示：
X矩陣為395*4258，共395個文件，4258個單詞，主要用於計算每行文件單詞出現的次數（詞頻），然後輸出X[5,5]矩陣；
vocab為具體的單詞，共4258個，它對應X的一行資料，其中輸出的前5個單詞，X中第0列對應church，其值為詞頻；
titles為載入的文章標題，共395篇文章，同時輸出0~4篇文章標題如下。

[python]view
 plain copy

type(X): <type 'numpy.ndarray'>  

shape: (395L, 4258L)  

[[ 1  0  1  0  0]  

 [ 7  0  2  0  0]  

 [ 0  0  0  1 10]  

 [ 6  0  1  0  0]  

 [ 0  0  0  2 14]]  

type(vocab): <type 'tuple'>  

len(vocab): 4258  

('church', 'pope', 'years', 'people', 'mother')  

type(titles): <type 'tuple'>  

len(titles): 395  

('0 UK: Prince Charles spearheads British royal revolution. LONDON 1996-08-20',  

 '1 GERMANY: Historic Dresden church rising from WW2 ashes. DRESDEN, Germany 1996-08-21',  

 "2 INDIA: Mother Teresa's condition said still unstable. CALCUTTA 1996-08-23",  

 '3 UK: Palace warns British weekly over Charles pictures. LONDON 1996-08-25',  

 '4 INDIA: Mother Teresa, slightly stronger, blesses nuns. CALCUTTA 1996-08-25')

From the above we can see that there are 395 news items (documents) and a vocabulary of size 4258. The document-term matrix, X, has a count of the number of occurences of each of the 4258 vocabulary words for each of the 395 documents.
下面是測試文件編號為0，單詞編號為3117的資料，X[0,3117]：

[python]view
 plain copy

# X[0,3117] is the number of times that word 3117 occurs in document 0  

doc_id = 0  

word_id = 3117  

print("doc id: {} word id: {}".format(doc_id, word_id))  

print("-- count: {}".format(X[doc_id, word_id]))  

print("-- word : {}".format(vocab[word_id]))  

print("-- doc  : {}".format(titles[doc_id]))  

'''''輸出 

doc id: 0 word id: 3117 

-- count: 2 

-- word : heir-to-the-throne 

-- doc  : 0 UK: Prince Charles spearheads British royal revolution. LONDON 1996-08-20 

'''

2.訓練模型

其中設定20個主題，500次迭代

[python]view
 plain copy

model = lda.LDA(n_topics=20, n_iter=500, random_state=1)  

model.fit(X)          # model.fit_transform(X) is also available   

3.主題-單詞（topic-word）分佈

程式碼如下所示，計算'church', 'pope', 'years'這三個單詞在各個主題(n_topocs=20，共20個主題)中的比重，同時輸出前5個主題的比重和，其值均為1。

[python]view
 plain copy

topic_word = model.topic_word_  

print("type(topic_word): {}".format(type(topic_word)))  

print("shape: {}".format(topic_word.shape))  

print(vocab[:3])  

print(topic_word[:, :3])  

for n in range(5):  

    sum_pr = sum(topic_word[n,:])  

    print("topic: {} sum: {}".format(n, sum_pr))

輸出結果如下：

[python]view
 plain copy

type(topic_word): <type 'numpy.ndarray'>  

shape: (20L, 4258L)  

('church', 'pope', 'years')  

[[  2.72436509e-06   2.72436509e-06   2.72708945e-03]  

 [  2.29518860e-02   1.08771556e-06   7.83263973e-03]  

 [  3.97404221e-03   4.96135108e-06   2.98177200e-03]  

 [  3.27374625e-03   2.72585033e-06   2.72585033e-06]  

 [  8.26262882e-03   8.56893407e-02   1.61980569e-06]  

 [  1.30107788e-02   2.95632328e-06   2.95632328e-06]  

 [  2.80145003e-06   2.80145003e-06   2.80145003e-06]  

 [  2.42858077e-02   4.66944966e-06   4.66944966e-06]  

 [  6.84655429e-03   1.90129250e-06   6.84655429e-03]  

 [  3.48361655e-06   3.48361655e-06   3.48361655e-06]  

 [  2.98781661e-03   3.31611166e-06   3.31611166e-06]  

 [  4.27062069e-06   4.27062069e-06   4.27062069e-06]  

 [  1.50994982e-02   1.64107142e-06   1.64107142e-06]  

 [  7.73480150e-07   7.73480150e-07   1.70946848e-02]  

 [  2.82280146e-06   2.82280146e-06   2.82280146e-06]  

 [  5.15309856e-06   5.15309856e-06   4.64294180e-03]  

 [  3.41695768e-06   3.41695768e-06   3.41695768e-06]  

 [  3.90980357e-02   1.70316633e-03   4.42279319e-03]  

 [  2.39373034e-06   2.39373034e-06   2.39373034e-06]  

 [  3.32493234e-06   3.32493234e-06   3.32493234e-06]]  

topic: 0 sum: 1.0  

topic: 1 sum: 1.0  

topic: 2 sum: 1.0  

topic: 3 sum: 1.0  

topic: 4 sum: 1.0

4.計算各主題Top-N個單詞

下面這部分程式碼是計算每個主題中的前5個單詞

[python]view
 plain copy

n = 5  

for i, topic_dist in enumerate(topic_word):  

    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n+1):-1]  

    print('*Topic {}\n- {}'.format(i, ' '.join(topic_words)))  

輸出如下所示：

[python]view
 plain copy

*Topic 0  

- government british minister west group  

*Topic 1  

- church first during people political  

*Topic 2  

- elvis king wright fans presley  

*Topic 3  

- yeltsin russian russia president kremlin  

*Topic 4  

- pope vatican paul surgery pontiff  

*Topic 5  

- family police miami versace cunanan  

*Topic 6  

- south simpson born york white  

*Topic 7  

- order church mother successor since  

*Topic 8  

- charles prince diana royal queen  

*Topic 9  

- film france french against actor  

*Topic 10  

- germany german war nazi christian  

*Topic 11  

- east prize peace timor quebec  

*Topic 12  

- n't told life people church  

*Topic 13  

- years world time year last  

*Topic 14  

- mother teresa heart charity calcutta  

*Topic 15  

- city salonika exhibition buddhist byzantine  

*Topic 16  

- music first people tour including  

*Topic 17  

- church catholic bernardin cardinal bishop  

*Topic 18  

- harriman clinton u.s churchill paris  

*Topic 19  

- century art million museum city

5.文件-主題（Document-Topic）分佈

計算輸入前10篇文章最可能的Topic

[python]view
 plain copy

doc_topic = model.doc_topic_  

print("type(doc_topic): {}".format(type(doc_topic)))  

print("shape: {}".format(doc_topic.shape))  

for n in range(10):  

    topic_most_pr = doc_topic[n].argmax()  

    print("doc: {} topic: {}".format(n, topic_most_pr))

輸出如下所示：

[python]view
 plain copy

type(doc_topic): <type 'numpy.ndarray'>  

shape: (395L, 20L)  

doc: 0 topic: 8  

doc: 1 topic: 1  

doc: 2 topic: 14  

doc: 3 topic: 8  

doc: 4 topic: 14  

doc: 5 topic: 14  

doc: 6 topic: 14  

doc: 7 topic: 14  

doc: 8 topic: 14  

doc: 9 topic: 8

6.兩種作圖分析

詳見英文原文，包括計算各個主題中單詞權重分佈的情況：

[python]view
 plain copy

import matplotlib.pyplot as plt  

f, ax= plt.subplots(5, 1, figsize=(8, 6), sharex=True)  

for i, k in enumerate([0, 5, 9, 14, 19]):  

    ax[i].stem(topic_word[k,:], linefmt='b-',  

               markerfmt='bo', basefmt='w-')  

    ax[i].set_xlim(-50,4350)  

    ax[i].set_ylim(0, 0.08)  

    ax[i].set_ylabel("Prob")  

    ax[i].set_title("topic {}".format(k))  

ax[4].set_xlabel("word")  

plt.tight_layout()  

plt.show()

輸出如下圖所示：

第二種作圖是計算文件具體分佈在那個主題，程式碼如下所示：

[python]view
 plain copy

import matplotlib.pyplot as plt  

f, ax= plt.subplots(5, 1, figsize=(8, 6), sharex=True)  

for i, k in enumerate([1, 3, 4, 8, 9]):  

    ax[i].stem(doc_topic[k,:], linefmt='r-',  

               markerfmt='ro', basefmt='w-')  

    ax[i].set_xlim(-1, 21)  

    ax[i].set_ylim(0, 1)  

    ax[i].set_ylabel("Prob")  

    ax[i].set_title("Document {}".format(k))  

ax[4].set_xlabel("Topic")  

plt.tight_layout()  

plt.show()

輸出結果如下圖：

三. 總結

這篇文章主要是對Python下LDA用法的入門介紹，下一篇文章將結合具體的txt文字內容進行分詞處理、文件主題分佈計算等。其中也會涉及Python計算詞頻tf和tfidf的方法。
由於使用fit()總報錯“TypeError: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'safe'”，後使用sklearn中計算詞頻TF方法：
http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction

總之，希望文章對你有所幫助吧！尤其是剛剛接觸機器學習、Sklearn、LDA的同學，畢竟我自己其實也只是一個門外漢，沒有系統的學習過機器學習相關的內容，所以也非常理解那種不知道如何使用一種演算法的過程，畢竟自己就是嘛，而當你熟練使用後才會覺得它非常簡單，所以入門也是這篇文章的宗旨吧！
最後非常感謝上面提到的文章連結作者，感謝他們的分享。如果有不足之處，還請海涵~
(By:Eastmount 2016-03-17 深夜3點半 http://blog.csdn.net/eastmount/ )

[python] LDA處理文件主題分佈及分詞、詞頻、tfidf計算
2017-11-07
PythonLDA分詞
LDA入門級學習筆記
2016-05-08
LDA筆記
【筆記】jQuery原始碼（文件處理3）
2018-03-31
筆記jQuery原始碼
【廖雪峰python入門筆記】Unicode編碼_UnicodeDecodeError處理
2018-07-05
Python筆記UnicodeError
Python入門筆記（程式碼中成長）
2020-10-21
Python筆記
LDA主題模型
2020-06-10
LDA模型
Python檔案處理-專題筆記
2018-01-07
Python筆記
主題模型值LDA
2021-01-20
模型LDA
文字主題模型之LDA(一) LDA基礎
2017-05-17
模型LDA
Python入門筆記
2019-03-10
Python筆記
Python 入門筆記
2019-07-16
Python筆記
LDA主題模型簡介及Python實現
2022-10-31
LDA模型Python
主題模型-LDA淺析
2015-01-30
模型LDA
通俗理解LDA主題模型
2014-11-17
LDA模型
文字主題模型之LDA(三) LDA求解之變分推斷EM演算法
2017-05-22
模型LDA演算法
python入門筆記1
2020-04-05
Python筆記
WPF 入門筆記 - 01 - 入門基礎以及常用佈局
2023-05-20
筆記
Gensim做中文主題模型（LDA)
2014-08-19
模型LDA
Python 影像處理 OpenCV （1）：入門
2020-05-18
PythonOpenCV
Python 入門級報錯處理
2019-02-12
Python
Python影象處理庫Pillow入門
2016-05-03
Python
學習筆記|AS入門（三）佈局篇
2017-12-21
筆記
小洋的Python入門筆記😀
2024-11-20
Python筆記
Python筆記(五)——檔案處理
2020-09-30
Python筆記
flask文件學習筆記1-快速入門
2019-02-16
Flask筆記
【倉頡】入門文件程式碼圓周率估算程式碼更正
2024-08-04
LDA臨時筆記，待整理
2018-07-31
LDA筆記
Python基礎入門筆記（二）
2019-03-01
Python筆記
Python基礎入門筆記（一）
2018-09-15
Python筆記
【廖雪峰python入門筆記】dict
2018-07-07
Python筆記
【廖雪峰python入門筆記】set
2018-07-07
Python筆記
【廖雪峰python入門筆記】切片
2018-07-07
Python筆記
【廖雪峰python入門筆記】迭代
2018-07-07
Python筆記
python學習筆記（一）——入門
2013-03-30
Python筆記
《Python資料處理》讀書筆記
2017-10-26
Python筆記
SpringBoot文件之入門的閱讀筆記
2024-08-18
Spring Boot筆記
Python 資料處理庫 pandas 入門教程
2018-04-17
Python
Python 官方文件：入門教程
2018-08-16
Python