前面分析統計了金庸名著《倚天屠龍記》中人物按照出現次數並排序
https://www.cnblogs.com/becks/p/11421214.html
然後使用pyecharts,統計B站某視訊彈幕內容,並繪製成詞雲顯示
https://www.cnblogs.com/becks/p/14743080.html
這次,就用分析統計下《三國演義》這部書裡各角色出現的頻率,並繪製成詞雲顯示,看看誰是絕對的主角吧
首先,我們需要把這部書裡出現的人物都列舉出來,畢竟只統計角色資訊,不需要把非人物名也統計進來
角色 = {'劉備','諸葛亮','關羽','張飛','劉禪',"孫權",'趙雲','司馬懿','周瑜','曹操','袁紹','馬超','魏延', '黃忠','姜維','馬岱','龐德','孟獲','劉表','董卓','孫策', '魯肅','司馬昭','夏侯淵','王平','劉璋','袁術','呂蒙','甘寧','鄧艾','曹仁', '陸遜','許褚','龐統','曹洪','李典','曹丕','廖化','曹真','呂布'}
然後就是讀取實現準備好的《三國演義》書籍txt文件格式,使用jieba庫對文件內容進行處理
# -*-coding:utf8-*- # encoding:utf-8 import jieba #倒入jieba庫 import os import sys from collections import Counter#分詞後詞頻統計 from pyecharts.charts import WordCloud#詞雲 path = os.path.abspath(os.path.dirname(sys.argv[0])) txt=open(path+'\\171182.txt',"r", encoding='utf-8').read() #讀取三國演義文字 words=jieba.lcut(txt) #jieba庫分析文字 counts={}
在就是統計指定角色姓名出現次數
for word in words: if len(word)<=1: continue elif word in 角色: counts[word]=counts.get(word,0)+1 else: None
繪製詞雲
items=list(counts.items())#字典到列表 wordcloud = WordCloud() wordcloud.add("",items,word_size_range=[15, 80],rotate_step=30,shape='cardioid') wordcloud.render(path+'\\wordcloud.html')
執行指令碼後檢視生成檔案
曹操兩個字的顯示的最大,說明整部書裡出現的次數最多。這肯定不對,羅貫中是劉備粉啊,
後來想了下,在三國裡,直呼人姓名那是罵人,是損。那些所謂的正派人士都是有雅稱的,比如臥龍、諸葛等等
改了下程式碼,把這些人的雅稱也匹配進去
劉備 = {"玄德","玄德曰","先主","劉豫州","劉皇叔",'劉玄德','劉使君'} 諸葛亮 = {"孔明","孔明曰","臥龍","臥龍先生","諸葛先生",'孔明先生','諸葛丞相','諸葛'} 關羽 = {"關公","雲長","漢壽亭侯","關雲長"} 曹操 = {"孟德",'曹孟德','曹操'} 張飛 = {"張翼德",'翼德'}
同時,統計部分也作了處理
for word in words: #篩選分析後的名詞 if len(word)<=1: #因為片語中的漢字數大於1個即認為是一個片語,所以通過continue結束掉讀取的漢字書為1的內容 continue #elif word in exculdes: #continue #elif word in 諸葛亮 or word in 劉備 or word in 關羽 or word in 曹操: #counts[word]=counts.get(word,0)+1 elif word in 劉備: word ="劉備" counts[word]=counts.get(word,0)+1 elif word in 諸葛亮: word ="諸葛亮" counts[word]=counts.get(word,0)+1 elif word in 曹操: word ="曹操" counts[word]=counts.get(word,0)+1 elif word in 關羽: word ="關羽" counts[word]=counts.get(word,0)+1 elif word in 張飛: word ="張飛" counts[word]=counts.get(word,0)+1 elif word in 其他: counts[word]=counts.get(word,0)+1 else: None
再次執行,嗯,諸葛亮是王者,諸葛亮合計出現了1350次,劉備合計出現1271次
附整個程式碼
# -*-coding:utf8-*- # encoding:utf-8 import jieba #倒入jieba庫 import os import sys from collections import Counter#分詞後詞頻統計 from pyecharts.charts import WordCloud#詞雲 path = os.path.abspath(os.path.dirname(sys.argv[0])) txt=open(path+'\\三國演義.txt',"r", encoding='utf-8').read() #文字 words=jieba.lcut(txt) #jieba庫分析文字 counts={} 劉備 = {"玄德","玄德曰","先主","劉豫州","劉皇叔",'劉玄德','劉使君'} 諸葛亮 = {"孔明","孔明曰","臥龍","臥龍先生","諸葛先生",'孔明先生','諸葛丞相','諸葛'} 關羽 = {"關公","雲長","漢壽亭侯","關雲長"} 劉禪 = {"後主"} 曹操 = {"孟德",'曹孟德','曹操'} 張飛 = {"張翼德",'翼德'} 其他 = {"孫權",'趙雲','司馬懿','周瑜','劉禪','袁紹','馬超','魏延','黃忠','姜維','馬岱','龐德','孟獲','劉表','董卓','孫策', '魯肅','司馬昭','夏侯淵','王平','劉璋','袁術','呂蒙','甘寧','鄧艾','曹仁','陸遜','許褚','龐統','曹洪','李典','曹丕','廖化','曹真','呂布'} for word in words: #篩選分析後的名詞 if len(word)<=1: #因為片語中的漢字數大於1個即認為是一個片語,所以通過continue結束掉讀取的漢字書為1的內容 continue #elif word in exculdes: #continue #elif word in 諸葛亮 or word in 劉備 or word in 關羽 or word in 曹操: #counts[word]=counts.get(word,0)+1 elif word in 劉備: word ="劉備" counts[word]=counts.get(word,0)+1 elif word in 諸葛亮: word ="諸葛亮" counts[word]=counts.get(word,0)+1 elif word in 曹操: word ="曹操" counts[word]=counts.get(word,0)+1 elif word in 關羽: word ="關羽" counts[word]=counts.get(word,0)+1 elif word in 張飛: word ="張飛" counts[word]=counts.get(word,0)+1 elif word in 其他: counts[word]=counts.get(word,0)+1 else: None items=list(counts.items())#字典到列表 wordcloud = WordCloud() wordcloud.add("",items,word_size_range=[15, 80],rotate_step=30,shape='cardioid') wordcloud.render(path+'\\wordcloud.html')