nltk-構建和使用語料庫-可用於小說的推薦-完整例項

筆尖的痕發表於2016-09-22

步驟1：構建語料庫：

[python]view
 plain copy

#!/usr/bin/env python  

#-*-coding=utf-8-*-  

#資料來源目錄(二級目錄)  

sourceDataDir='data'  

#資料來源檔案列表  

fileLists = []  

import os  

from gensim import corpora, models, similarities  

def getSourceFileLists(sourceDataDir):    

    fileLists = []  

    subDirList = os.listdir(sourceDataDir)  

    for subDir in subDirList:  

        subList = os.listdir(sourceDataDir + '/' + subDir)  

        fileList = [ sourceDataDir+'/'+subDir+'/'+ x for x in subList if os.path.isfile(sourceDataDir+'/'+subDir+'/'+x)]  

        fileLists += fileList  

    return  fileLists     

fileLists = getSourceFileLists(sourceDataDir)    

if 0 < len(fileLists):   

    import codecs  

    import jieba  

    punctuations = ['','\n','\t',',', '.', ':', ';', '?', '(', ')', '[', ']', '&', '!', '*', '@', '#', '$', '%']   

    if not os.path.exists('dict'):  

        os.mkdir("dict")   

    if not os.path.exists('corpus'):  

        os.mkdir("corpus")   

    for fileName in fileLists:  

        print fileName  

        hFile = None  

        content = None  

        try:  

            hFile = codecs.open(fileName,'r','gb18030')  

            content = hFile.readlines()  

        except Exception,e:  

            print e  

        finally:  

            if hFile:  

                hFile.close()  

        if content:  

            fileFenci = [ x for x in jieba.cut(' '.join(content),cut_all=True)]  

            fileFenci2 = [word for word in fileFenci if not word in punctuations]    

            texts = [fileFenci2]  

            all_tokens = sum(texts, [])  

            tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)  

            texts = [[word for word in text if word not in tokens_once] for text in texts]  

            sFileDir, sFileName = os.path.split(fileName)  

            dictFileName = 'dict/'+sFileName+'.dict'  

            corpusFileName = 'corpus/'+sFileName+'.mm'  

            dictionary = corpora.Dictionary(texts)  

            dictionary.save_as_text(dictFileName)  

            corpus = ([dictionary.doc2bow(text) for text in texts])  

            corpora.MmCorpus.serialize(corpusFileName, corpus)   

print 'Build corpus done'

資料來源：

來自 http://d1.txthj.com/newrar/txthj_264.rar 的83篇小說，將其目錄存放在目錄 ./data/下。

載入時作為二層目錄處理

輸出：

./dict 和 ./corpus

在對應目錄下生成 xxx.dict 和 xxx.mm，xxx為原檔案的全稱(不包括路徑，包括字尾)

步驟2：載入語料庫，相似性分析

[python]view
 plain copy

#!/usr/bin/env python  

#-*-coding=utf-8-*-  

import os  

from gensim import corpora, models, similarities  

def getFileList(dir):              

    return [ dir + x for x in os.listdir(dir)]  

dictLists =  getFileList('./dict/')  

class LoadDictionary(object):  

    def __init__(self, dictionary):  

        self.dictionary = dictionary  

    def __iter__(self):  

        for dictFile in dictLists:  

            sFileRaw, sFilePostfix = os.path.splitext(dictFile)  

            sFileDir, sFileName = os.path.split(sFileRaw)  

            (dictFile, corpusFile) = ( './dict/' + sFileName + '.dict',  './corpus/'+sFileName + '.mm')  

            yield self.dictionary.load_from_text(dictFile)  

class LoadCorpus(object):  

    def __iter__(self):  

        for dictFile in dictLists:  

            sFileRaw, sFilePostfix = os.path.splitext(dictFile)  

            sFileDir, sFileName = os.path.split(sFileRaw)  

            (dictFile, corpusFile) = ( './dict/' + sFileName + '.dict',  './corpus/'+sFileName + '.mm')  

            yield corpora.MmCorpus(corpusFile)  

""" 

    預處理(easy_install nltk) 

"""  

#簡化的 中文+英文 預處理  

def pre_process_cn(inputs, low_freq_filter = True):  

    """ 

        1.去掉停用詞 

        2.去掉標點符號 

        3.處理為詞幹 

        4.去掉低頻詞 

    """  

    import nltk  

    import jieba.analyse  

    from nltk.tokenize import word_tokenize  

    texts_tokenized = []  

    for document in inputs:  

        texts_tokenized_tmp = []  

        for word in word_tokenize(document):  

            texts_tokenized_tmp += jieba.analyse.extract_tags(word,10)  

        texts_tokenized.append(texts_tokenized_tmp)      

    texts_filtered_stopwords = texts_tokenized  

    #去除標點符號  

    english_punctuations = [',', '.', ':', ';', '?', '(', ')', '[', ']', '&', '!', '*', '@', '#', '$', '%']  

    texts_filtered = [[word for word in document if not word in english_punctuations] for document in texts_filtered_stopwords]  

    #詞幹化  

    from nltk.stem.lancaster import LancasterStemmer  

    st = LancasterStemmer()  

    texts_stemmed = [[st.stem(word) for word in docment] for docment in texts_filtered]  

    #去除過低頻詞  

    if low_freq_filter:  

        all_stems = sum(texts_stemmed, [])  

        stems_once = set(stem for stem in set(all_stems) if all_stems.count(stem) == 1)  

        texts = [[stem for stem in text if stem not in stems_once] for text in texts_stemmed]  

    else:  

        texts = texts_stemmed  

    return texts  

dictionary = corpora.dictionary.Dictionary()  

dictionary_memory_friendly = LoadDictionary(dictionary)  

for vector in dictionary_memory_friendly:   

    dictionary = vector  

corpus = []  

corpus_memory_friendly = LoadCorpus()  

for vector in corpus_memory_friendly:   

    corpus.append(vector[0])  

if 0 < len(corpus):  

    tfidf = models.TfidfModel(corpus)  

    corpus_tfidf = tfidf[corpus]  

    model = models.LsiModel(corpus_tfidf, id2word=None, num_topics=20,  chunksize=2000000) #不指定 id2word=dictionary 時，LsiModel內部會根據 corpus 重建 dictionary  

    index = similarities.Similarity('./novel_', model[corpus], num_features=len(corpus))   

    #要處理的物件登場，這裡隨便從小說中擷取了一段話  

    target_courses = ['男人們的臉上沉重而冷凝，蒙著面紗的女人們則是發出斷斷續續的哭泣聲，他們無比專注地看著前方，見證一場生與死的拉鋸戰。']  

    target_text = pre_process_cn(target_courses, low_freq_filter=False)  

    """ 

    對具體物件相似度匹配 

    """  

    #選擇一個基準資料  

    ml_course = target_text[0]  

    #詞袋處理  

    ml_bow = dictionary.doc2bow(ml_course)     

    #在上面選擇的模型資料 lsi model 中，計算其他資料與其的相似度  

    ml_lsi = model[ml_bow]     #ml_lsi 形式如 (topic_id, topic_value)  

    sims = index[ml_lsi]     #sims 是最終結果了， index[xxx] 呼叫內建方法 __getitem__() 來計算ml_lsi  

    #排序，為輸出方便  

    sort_sims = sorted(enumerate(sims), key=lambda item: -item[1])  

    #檢視結果  

    print sort_sims[0:10]     

    print len(dictLists)  

    print dictLists[sort_sims[1][0]]   

    print dictLists[sort_sims[2][0]]   

    print dictLists[sort_sims[3][0]]

說明：

yield的使用是為了更好的記憶體效率。

遺留問題：

步驟2會有提示：

/usr/lib/python2.7/dist-packages/scipy/sparse/compressed.py:122: UserWarning: indices array has non-integer dtype (float64)

不影響處理過程

求推薦高可用內嵌message queue庫
2018-09-11
3.1.5.1 關於啟動資料庫例項
2020-03-14
資料庫
Oracle資料庫AWR的使用例項詳解
2017-04-05
Oracle資料庫
postgresql的java例項肯定可用
2014-10-10
SQLJava
CMake構建學習筆記17-uriparser庫的構建和使用
2024-09-17
筆記
推薦系統--完整的架構設計和演算法(協同過濾、隱語義)
2019-09-09
架構演算法
2 Day DBA-管理Oracle例項-管理資料庫儲存結構-關於資料庫儲存結構
2014-01-27
Oracle資料庫
基於verdaccio的npm私有倉庫搭建和使用總結
2018-03-20
NPM
資料庫–如何連線RDS例項，使用雲資料庫？
2017-07-20
資料庫
PHP 完整表單例項
2020-12-29
PHP單例
oracle 資料庫例項
2014-10-22
Oracle資料庫
資料庫和例項
2007-02-10
資料庫
使用rman將資料庫遷移到ASM例項
2014-03-24
資料庫ASM
3.1.5.7 啟動例項、掛載資料庫並啟動完整的媒體恢復
2020-03-14
資料庫
推薦一個基於Dapr的 Red Dog 的完整微服務應用程式
2022-02-04
微服務
多例項資料庫刪除例項
2009-05-03
資料庫
『軟體推薦』快讀免費小說app2.3.1
2019-01-14
APP
Oracle例項和Oracle資料庫(Oracle體系結構)
2013-12-05
Oracle資料庫
Go 語言—資料結構和演算法專案推薦
2021-07-16
Go資料結構演算法
基於 Apache ShardingSphere 構建高可用分散式資料庫
2022-03-08
Apache分散式資料庫
js with語句使用程式碼例項
2017-04-08
JS
基於使用者的協同過濾來構建推薦系統
2020-06-25
【AMM】關於資料庫例項AMM引數說明
2014-02-16
資料庫
關於oracle的幾個概念：資料庫、例項、使用者名稱和密碼
2017-09-20
Oracle資料庫密碼
使用httpclient 連線 restful webservices例項絕對可用 get的--post
2014-10-21
HTTPclientRESTWeb
資料庫例項 (SQL Server)
2015-03-05
資料庫SQLServer
資料庫設計例項
2016-06-14
資料庫
單例項資料庫工具轉化多例項資料庫
2009-05-02
單例資料庫
單例項資料庫手工轉化多例項資料庫
2009-05-01
單例資料庫
設計模式使用例項(5)——建造者模式例項之資料庫連線管理
2021-09-09
設計模式資料庫
使用rman copy將資料庫遷移到ASM例項
2012-09-13
資料庫ASM
【開源庫推薦】#3 Android EventBus的使用
2021-11-17
Android
2 Day DBA-管理Oracle例項-管理資料庫儲存結構-關於資料檔案
2014-01-27
Oracle資料庫
推薦6個高效的語言處理Python庫
2019-08-12
Python
2 Day DBA-管理Oracle例項-管理資料庫儲存結構-關於其它儲存結構
2014-01-27
Oracle資料庫
【轉】新建例項開啟已有的資料庫 — 資料庫與例項的區分測試
2012-06-20
資料庫
推薦：IT自由職業者完整資源
2008-07-02
高可用架構例項：在多雲和多區域中穿行
2022-12-05
架構

nltk-構建和使用語料庫-可用於小說的推薦-完整例項

相關文章