探索Python資料分析（一）：NLTK庫和文字處理

發表於2016-01-17

使用者輸入的搜尋關鍵詞往往是模糊的，甚至包含大量的錯詞，我們無法直接使用這樣的資料進行諸如分類、聚類等下一步分析工作，第一步是對它們進行詞幹提取。

什麼是詞幹提取？

在語言形態學和資訊檢索裡，詞幹提取是去除詞綴得到詞根的過程─—得到單詞最一般的寫法。對於一個詞的形態詞根，詞幹並不需要完全相同；相關的詞對映到同一個詞幹一般能得到滿意的結果，即使該詞幹不是詞的有效根。從1968年開始在電腦科學領域出現了詞幹提取的相應演算法。很多搜尋引擎在處理詞彙時，對同義詞採用相同的詞幹作為查詢擴充，該過程叫做歸併。

一個面向英語的詞幹提取器，例如，要識別字串“cats”、“catlike”和“catty”是基於詞根“cat”；“stemmer”、“stemming”和“stemmed”是基於詞根“stem”。一根詞幹提取演算法可以簡化詞 “fishing”、“fished”、“fish”和“fisher” 為同一個詞根“fish”。

技術方案的選擇

Python和R是資料分析的兩種主要語言；相對於R，Python更適合有大量程式設計背景的資料分析初學者，尤其是已經掌握Python語言的程式設計師。所以我們選擇了Python和NLTK庫（Natual Language Tookit）作為文字處理的基礎框架。此外，我們還需要一個資料展示工具；對於一個資料分析師來說，資料庫的冗繁安裝、連線、建表等操作實在是不適合進行快速的資料分析，所以我們使用Pandas作為結構化資料和分析工具。

環境搭建

我們使用的是Mac OS X，已預裝Python 2.7.

安裝NLTK

1	sudo pip install nltk

安裝Pandas

1	sudo pip install pandas

對於資料分析來說，最重要的是分析結果，iPython notebook是必備的一款利器，它的作用在於可以儲存程式碼的執行結果，例如資料表格，下一次開啟時無需重新執行即可檢視。

安裝iPython notebook

1	sudo pip install ipython

建立一個工作目錄，在工作目錄下啟動iPython notebook，伺服器會開啟http://127.0.0.1:8080頁面，並將建立的程式碼文件儲存在工作目錄之下。

mkdir Codes

cd Codes

ipython notebook

文字處理

資料表建立

使用Pandas建立資料表我們使用得到的樣本資料，建立DataFrame——Pandas中一個支援行、列的2D資料結構。

from pandas import DataFrame

import pandas as pd

d = ['pets insurance','pets insure','pet insurance','pet insur','pet insurance"','pet insu']

df = DataFrame(d)

df.columns = ['Words']

顯示結果

	Words
0	pets insurance
1	pets insure
2	pet insurance
3	pet insur
4	pet insurance”
5	pet insu

NLTK分詞器介紹

RegexpTokenizer：正規表示式分詞器，使用正規表示式對文字進行處理，就不多作介紹。
PorterStemmer：波特詞幹演算法分詞器，原理可看這裡：http://snowball.tartarus.org/algorithms/english/stemmer.html
第一步，我們建立一個去除標點符號等特殊字元的正規表示式分詞器：

1 2	import nltk tokenizer = nltk.RegexpTokenizer(r'w+')

接下來，對準備好的資料表進行處理，新增詞幹將要寫入的列，以及統計列，預設預設值為1：

1 2	df["Stemming Words"] = "" df["Count"] = 1

讀取資料表中的Words列，使用波特詞幹提取器取得詞幹：

j = 0

while (j <= 5):

for word in tokenizer.tokenize(df["Words"][j]):

df["Stemming Words"][j] = df["Stemming Words"][j] + " " + nltk.PorterStemmer().stem_word(word)

j += 1

Good！到這一步，我們已經基本上實現了文字處理，結果顯示如下：

	Words	Stemming Words	Count
0	pets insurance	pet insur	1
1	pets insure	pet insur	1
2	pet insurance	pet insur	1
3	pet insur	pet insur	1
4	pet insurance”	pet insur	1
5	pet insu	pet insu	1

分組統計

在Pandas中進行分組統計，將統計表格儲存到一個新的DataFrame結構uniqueWords中：

1 2	uniqueWords = df.groupby(['Stemming Words'], as_index = False).sum().sort(['Count']) uniqueWords

	Stemming Words	Count
0	pet insu	1
1	pet insur	5

注意到了嗎？依然還有一個pet insu未能成功處理。

拼寫檢查

對於使用者拼寫錯誤的詞語，我們首先想到的是拼寫檢查，針對Python我們可以使用enchant:

1	sudo pip install enchant

使用enchant進行拼寫錯誤檢查，得到推薦詞：

import enchant

from nltk.metrics import edit_distance

class SpellingReplacer(object):

def __init__(self, dict_name='en', max_dist=2):

self.spell_dict = enchant.Dict(dict_name)

self.max_dist = 2

def replace(self, word):

if self.spell_dict.check(word):

return word

suggestions = self.spell_dict.suggest(word)

if suggestions and edit_distance(word, suggestions[0]) <=

self.max_dist:

return suggestions[0]

else:

return word

from replacers import SpellingReplacer

replacer = SpellingReplacer()

replacer.replace('insu')

'insu'

但是，結果依然不是我們預期的“insur”。能不能換種思路呢？

演算法特殊性

使用者輸入非常重要的特殊性來自於行業和使用場景。採取通用的英語大詞典來進行拼寫檢查，無疑是行不通的，並且某些詞語恰恰是拼寫正確，但本來卻應該是另一個詞。但是，我們如何把這些背景資訊和資料分析關聯起來呢？

經過一番思考，我認為最重要的參考庫恰恰就在已有的資料分析結果中，我們回來看看：

	Stemming Words	Count
0	pet insu	1
1	pet insur	5

已有的5個“pet insur”，其實就已經給我們提供了一份資料參考，我們已經可以對這份資料進行聚類，進一步除噪。

相似度計算

對已有的結果進行相似度計算，將滿足最小偏差的資料歸類到相似集中：

import Levenshtein

minDistance = 0.8

distance = -1

lastWord = ""

j = 0

while (j < 1):

lastWord = uniqueWords["Stemming Words"][j]

distance = Levenshtein.ratio(uniqueWords["Stemming Words"][j], uniqueWords["Stemming Words"][j + 1])

if (distance > minDistance):

uniqueWords["Stemming Words"][j] = uniqueWords["Stemming Words"][j + 1]

j += 1

uniqueWords

檢視結果，已經匹配成功！

	Stemming Words	Count
0	pet insur	1
1	pet insur	5

最後一步，重新對資料結果進行分組統計：

1 2	uniqueWords = uniqueWords.groupby(['Stemming Words'], as_index = False).sum() uniqueWords

到此，我們已經完成了初步的文字處理。

	Stemming Words	Count
0	pet insur	6

Python文字資料分析與處理
2018-08-29
Python
Python 自然語言處理（基於jieba分詞和NLTK）
2018-05-11
Python自然語言處理Jieba分詞
入門系列之：Python3 如何使用NLTK處理語言資料
2018-07-24
Python
【python技巧】文字處理-re庫字元匹配
2023-09-19
Python字元
機器學習：探索資料和資料預處理
2020-12-13
機器學習
Python利用pandas處理資料與分析
2024-03-25
Python
【Python資料分析基礎】: 異常值檢測和處理
2018-08-08
Python
【Python資料分析基礎】: 資料缺失值處理
2018-07-28
Python
Python資料分析基礎: 資料缺失值處理
2020-10-31
Python
Iron Python中使用NLTK庫
2024-02-06
Python
Python資料處理（二）：處理 Excel 資料
2019-02-16
PythonExcel
Python資料處理(一)：處理 JSON、XML、CSV 三種格式資料
2019-01-27
PythonJSONXML
[python] 基於Tablib庫處理表格資料
2023-11-30
Python
Python 資料處理庫 pandas 入門教程
2018-04-17
Python
Python 資料處理庫 pandas 進階教程
2018-04-18
Python
10 文字分析處理命令
2020-08-09
文字資料預處理：sklearn 中 CountVectorizer、TfidfTransformer 和 TfidfVectorizer
2018-09-13
ORM
python 處理資料
2020-10-29
Python
處理文字資料（上）:詞袋
2022-06-03
資料分析--資料預處理
2023-12-14
Python 柵格資料處理教程（一）
2024-08-13
Python
資料清洗與預處理：使用 Python Pandas 庫
2024-07-26
Python
Python深度學習（處理文字資料）--學習筆記（十二）
2020-11-12
Python深度學習筆記
OracleDG資料庫gap處理一列
2018-05-20
Oracle資料庫
Python 計算生態中那些著名的庫-文字處理
2019-08-08
Python
資料清洗和資料處理
2020-03-03
基於python的大資料分析-資料處理（程式碼實戰）
2019-08-30
Python大資料
[資料處理]python基礎
2019-02-02
Python
Python資料處理典型用法
2024-11-03
Python
Python資料處理-pandas用法
2020-12-17
Python
Python資料分析工具庫-Numpy 陣列支援庫（一）
2021-09-09
Python陣列
Python資料預處理：徹底理解標準化和歸一化
2020-07-08
Python
達觀智慧文字分析系統，賦能企業大資料加工處理
2022-01-27
大資料
一次ORACLE資料庫undo壞塊處理
2022-04-15
Oracle資料庫
python中多程式處理資料庫連線的問題
2020-12-18
Python資料庫
Python使用xlrd處理excel資料
2020-11-19
PythonExcel
Python影像處理庫——PIL
2021-03-27
Python
Python資料預處理:Dask和Numba並行化加速!
2018-06-06
Python並行
??Java開發者的Python快速實戰指南：探索向量資料庫之文字搜尋
2023-11-29
JavaPython資料庫