環境： Ubuntu 12.04, gensim, jieba

中文語料來自的 24M
jerry@hq:/u01/jerry/Reduced$ ls
C000008 C000010 C000013 C000014 C000016 C000020 C000022 C000023 C000024

各個資料夾的分類：
C000007 汽車
C000008 財經
C000010 IT
C000013 健康
C000014 體育
C000016 旅遊
C000020 教育
C000022 招聘
C000023 文化
C000024 軍事

步驟如下：

import jieba, os
from gensim import corpora, models, similarities

train_set = []

walk = os.walk('/u01/jerry/Reduced')
for root, dirs, files in walk:
for name in files:
f = open(os.path.join(root, name), 'r')
raw = f.read()
word_list = list(jieba.cut(raw, cut_all = False))
train_set.append(word_list)

dic = corpora.Dictionary(train_set)
corpus = [dic.doc2bow(text) for text in train_set]
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
lda = models.LdaModel(corpus_tfidf, id2word = dic, num_topics = 10)
corpus_lda = lda[corpus_tfidf]

>>> for i in range(0, 10):
... print lda.print_topic(i)
...
0.000*康寧 + 0.000*sohu2 + 0.000*wmv + 0.000*bbn7 + 0.000*mmst + 0.000*cid + 0.000*icp + 0.000*沙塵 + 0.000*性騷擾 + 0.000*烏里韋
0.000*media + 0.000*mid + 0.000*stream + 0.000*bbn7 + 0.000*mmst + 0.000*sohu2 + 0.000*cid + 0.000*icp + 0.000*wmv + 0.000*that
0.012* + 0.000*米蘭 + 0.000*老闆 + 0.000*男人 + 0.000*女人 + 0.000*她 + 0.000*小說 + 0.000*病人 + 0.000*我 + 0.000*女性
0.002*& + 0.002*nbsp + 0.001*０ + 0.001*; + 0.001*西安 + 0.001*報名 + 0.001*１ + 0.001*∶ + 0.001*00 + 0.001*５
0.002*手機 + 0.002*孩子 + 0.001*球 + 0.001*國家隊 + 0.001*勝 + 0.001*教練 + 0.001*; + 0.001*名單 + 0.001*閱讀 + 0.001*高校
0.001*' + 0.000* + 0.000*= + 0.000*var + 0.000*height + 0.000*width + 0.000*NewWin + 0.000*} + 0.000*{ + 0.000*+
0.003* + 0.002*比賽 + 0.002*我 + 0.002*　 + 0.001*; + 0.001*- + 0.001*， + 0.001*他 + 0.001*& + 0.001*―
0.000*航班 + 0.000*勞動合同 + 0.000*最低工資 + 0.000*農民工 + 0.000*養老保險 + 0.000*勞動者 + 0.000*用人單位 + 0.000*養老 + 0.000*上調 + 0.000*錦江
0.000*皮膚 + 0.000*碘 + 0.000*食物 + 0.000*維生素 + 0.000*營養 + 0.000*皮膚 + 0.000*蛋白質 + 0.000*藥物 + 0.000*症狀 + 0.000*體內
0.000* + 0.000*EMC + 0.000*包機 + 0.000*基金 + 0.000*陸純初 + 0.000*南越 + 0.000*Kashya + 0.000*西沙群島 + 0.000*Clariion + 0.000*西沙

感覺最終的主題模型不太理想，可以需要多增加引數num_topics的數量。

Gensim做中文主題模型（LDA)

相關文章