【自然語言處理篇】--以NLTK為基礎講解自然語⾔處理的原理和基礎知識

LHBlog發表於2018-07-08

原文網址 : https://www.cnblogs.com/LHWorldBlog/p/9279051.html

一、前述

Python上著名的⾃然語⾔處理庫⾃帶語料庫，詞性分類庫⾃帶分類，分詞，等等功能強⼤的社群⽀持，還有N多的簡單版wrapper。

二、文字預處理

1、安裝nltk

pip install -U nltk

安裝語料庫 (一堆對話，一對模型)

import nltk
nltk.download()

2、功能一覽表：

3、文字處理流程

4、Tokenize 把長句⼦拆成有“意義”的⼩部件

import jieba
seg_list = jieba.cut("我來到北北京清華⼤大學", cut_all=True)
print "Full Mode:", "/ ".join(seg_list) # 全模式
seg_list = jieba.cut("我來到北北京清華⼤大學", cut_all=False)
print "Default Mode:", "/ ".join(seg_list) # 精確模式
seg_list = jieba.cut("他來到了了⽹網易易杭研⼤大廈") # 預設是精確模式
print ", ".join(seg_list)
seg_list = jieba.cut_for_search("⼩小明碩⼠士畢業於中國科學院計算所，後在⽇日本京都⼤大學深造")
# 搜尋引擎模式
print ", ".join(seg_list)

結果：

【全模式】: 我/ 來到/ 北北京/ 清華/ 清華⼤大學/ 華⼤大/ ⼤大學
【精確模式】: 我/ 來到/ 北北京/ 清華⼤大學
【新詞識別】：他, 來到, 了了, ⽹網易易, 杭研, ⼤大廈
(此處，“杭研”並沒有在詞典中，但是也被Viterbi演算法識別出來了了)
【搜尋引擎模式】： ⼩小明, 碩⼠士, 畢業, 於, 中國, 科學, 學院, 科學院, 中國科學院, 計算,
計算所, 後, 在, ⽇日本, 京都, ⼤大學, ⽇日本京都⼤大學, 深造

社交⽹絡語⾔的tokenize:

import re
emoticons_str = r"""
(?:
[:=;] # 眼睛
[oO\-]? # ⿐鼻⼦子
[D\)\]\(\]/\\OpP] # 嘴
)"""
regex_str = [
emoticons_str,
r'<[^>]+>', # HTML tags
r'(?:@[\w_]+)', # @某⼈人
r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)", # 話題標籤
r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&amp;+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+',
# URLs
r'(?:(?:\d+,?)+(?:\.?\d+)?)', # 數字
r"(?:[a-z][a-z'\-_]+[a-z])", # 含有 - 和 ‘ 的單詞
r'(?:[\w_]+)', # 其他
r'(?:\S)' # 其他
]

正規表示式對照表
http://www.regexlab.com/zh/regref.htm

這樣能處理社交語言中的表情等符號：

tokens_re = re.compile(r'('+'|'.join(regex_str)+')', re.VERBOSE | re.IGNORECASE)
emoticon_re = re.compile(r'^'+emoticons_str+'$', re.VERBOSE | re.IGNORECASE)
def tokenize(s):
return tokens_re.findall(s)
def preprocess(s, lowercase=False):
tokens = tokenize(s)
if lowercase:
tokens = [token if emoticon_re.search(token) else token.lower() for token in
tokens]
return tokens
tweet = 'RT @angelababy: love you baby! :D http://ah.love #168cm'
print(preprocess(tweet))
# ['RT', '@angelababy', ':', 'love', 'you', 'baby',
# ’!', ':D', 'http://ah.love', '#168cm']

5、詞形歸⼀化

Stemming 詞⼲提取：⼀般來說，就是把不影響詞性的inflection的⼩尾巴砍掉
walking 砍ing = walk
walked 砍ed = walk
Lemmatization 詞形歸⼀：把各種型別的詞的變形，都歸為⼀個形式
went 歸⼀ = go
are 歸⼀ = be

>>> from nltk.stem.porter import PorterStemmer
>>> porter_stemmer = PorterStemmer()
>>> porter_stemmer.stem(‘maximum’)
u’maximum’
>>> porter_stemmer.stem(‘presumably’)
u’presum’
>>> porter_stemmer.stem(‘multiply’)
u’multipli’
>>> porter_stemmer.stem(‘provision’)
u’provis’
>>> from nltk.stem import SnowballStemmer
>>> snowball_stemmer = SnowballStemmer(“english”)
>>> snowball_stemmer.stem(‘maximum’)
u’maximum’
>>> snowball_stemmer.stem(‘presumably’)
u’presum’
>>> from nltk.stem.lancaster import LancasterStemmer
>>> lancaster_stemmer = LancasterStemmer()
>>> lancaster_stemmer.stem(‘maximum’)
‘maxim’
>>> lancaster_stemmer.stem(‘presumably’)
‘presum’
>>> lancaster_stemmer.stem(‘presumably’)
‘presum’
>>> from nltk.stem.porter import PorterStemmer
>>> p = PorterStemmer()
>>> p.stem('went')
'went'
>>> p.stem('wenting')
'went'

6、詞性Part-Of-Speech

>>> import nltk
>>> text = nltk.word_tokenize('what does the fox say')
>>> text
['what', 'does', 'the', 'fox', 'say']
>>> nltk.pos_tag(text)
[('what', 'WDT'), ('does', 'VBZ'), ('the', 'DT'), ('fox', 'NNS'), ('say', 'VBP')]

7、Stopwords

⾸先記得在console⾥⾯下載⼀下詞庫
或者 nltk.download(‘stopwords’)

from nltk.corpus import stopwords
# 先token⼀一把，得到⼀一個word_list
# ...
# 然後filter⼀一把
filtered_words =
[word for word in word_list if word not in stopwords.words('english')]

8、⼀條⽂本預處理流⽔線

三、自然語言處理應用。

實際上預處理就是將文字轉換為Word_List，自然語言處理再轉變成計算機能識別的語言。

自然語言處理有以下幾個應用：情感分析，⽂本相似度，⽂本分類

1、情感分析

最簡單的 sentiment dictionary,類似於關鍵詞打分機制.

like 1
good 2
bad -2
terrible -3

sentiment_dictionary = {}
for line in open('data/AFINN-111.txt')
word, score = line.split('\t')
sentiment_dictionary[word] = int(score)
# 把這個打分表記錄在⼀一個Dict上以後
# 跑⼀一遍整個句句⼦子，把對應的值相加
total_score = sum(sentiment_dictionary.get(word, 0) for word in words)
# 有值就是Dict中的值，沒有就是0
# 於是你就得到了了⼀一個 sentiment score

顯然這個⽅法太Naive,新詞怎麼辦？特殊詞彙怎麼辦？更深層次的玩意⼉怎麼辦？

加上ML情感分析

from nltk.classify import NaiveBayesClassifier
# 隨⼿手造點訓練集
s1 = 'this is a good book'
s2 = 'this is a awesome book'
s3 = 'this is a bad book'
s4 = 'this is a terrible book'
def preprocess(s):
# Func: 句句⼦子處理理
# 這⾥裡里簡單的⽤用了了split(), 把句句⼦子中每個單詞分開
# 顯然 還有更更多的processing method可以⽤用
return {word: True for word in s.lower().split()}
# return⻓長這樣:
# {'this': True, 'is':True, 'a':True, 'good':True, 'book':True}
# 其中, 前⼀一個叫fname, 對應每個出現的⽂文字單詞;
# 後⼀一個叫fval, 指的是每個⽂文字單詞對應的值。
# 這⾥裡里我們⽤用最簡單的True,來表示,這個詞『出現在當前的句句⼦子中』的意義。
# 當然啦, 我們以後可以升級這個⽅方程, 讓它帶有更更加⽜牛逼的fval, ⽐比如 word2vec

# 把訓練集給做成標準形式
training_data = [[preprocess(s1), 'pos'],
[preprocess(s2), 'pos'],
[preprocess(s3), 'neg'],
[preprocess(s4), 'neg']]
# 餵給model吃
model = NaiveBayesClassifier.train(training_data)
# 打出結果
print(model.classify(preprocess('this is a good book')))

2、文字相似度

⽤元素頻率表⽰⽂本特徵，常見的做法

然後用餘弦定理來計算文字相似度：

Frequency 頻率統計：

import nltk
from nltk import FreqDist
# 做個詞庫先
corpus = 'this is my sentence ' \
'this is my life ' \
'this is the day'
# 隨便便tokenize⼀一下
# 顯然, 正如上⽂文提到,
# 這⾥裡里可以根據需要做任何的preprocessing:
# stopwords, lemma, stemming, etc.
tokens = nltk.word_tokenize(corpus)
print(tokens)
# 得到token好的word list
# ['this', 'is', 'my', 'sentence',
# 'this', 'is', 'my', 'life', 'this',
# 'is', 'the', 'day']
# 借⽤用NLTK的FreqDist統計⼀一下⽂文字出現的頻率
fdist = FreqDist(tokens)
# 它就類似於⼀一個Dict
# 帶上某個單詞, 可以看到它在整個⽂文章中出現的次數
print(fdist['is'])
# 3

# 好, 此刻, 我們可以把最常⽤用的50個單詞拿出來
standard_freq_vector = fdist.most_common(50)
size = len(standard_freq_vector)
print(standard_freq_vector)
# [('is', 3), ('this', 3), ('my', 2),
# ('the', 1), ('d

3、文字分類

TF: Term Frequency, 衡量⼀個term在⽂檔中出現得有多頻繁。
TF(t) = (t出現在⽂檔中的次數) / (⽂檔中的term總數).
IDF: Inverse Document Frequency, 衡量⼀個term有多重要。
有些詞出現的很多，但是明顯不是很有卵⽤。⽐如’is'，’the‘，’and‘之類
的。
為了平衡，我們把罕見的詞的重要性（weight）搞⾼，
把常見詞的重要性搞低。
IDF(t) = log_e(⽂檔總數 / 含有t的⽂檔總數).
TF-IDF = TF * IDF

舉個慄⼦? :
⼀個⽂檔有100個單詞，其中單詞baby出現了3次。
那麼，TF(baby) = (3/100) = 0.03.
好，現在我們如果有10M的⽂檔， baby出現在其中的1000個⽂檔中。
那麼，IDF(baby) = log(10,000,000 / 1,000) = 4
所以， TF-IDF(baby) = TF(baby) * IDF(baby) = 0.03 * 4 = 0.12

from nltk.text import TextCollection
# ⾸首先, 把所有的⽂文件放到TextCollection類中。
# 這個類會⾃自動幫你斷句句, 做統計, 做計算
corpus = TextCollection(['this is sentence one',
'this is sentence two',
'this is sentence three'])
# 直接就能算出tfidf
# (term: ⼀一句句話中的某個term, text: 這句句話)
print(corpus.tf_idf('this', 'this is sentence four'))
# 0.444342
# 同理理, 怎麼得到⼀一個標準⼤大⼩小的vector來表示所有的句句⼦子?
# 對於每個新句句⼦子
new_sentence = 'this is sentence five'
# 遍歷⼀一遍所有的vocabulary中的詞:
for word in standard_vocab:
print(corpus.tf_idf(word, new_sentence))
# 我們會得到⼀一個巨⻓長(=所有vocab⻓長度)的向量量

目前幾種表達句子的方式：詞頻，TF-IDF。

Python 自然語言處理（基於jieba分詞和NLTK）
2018-05-11
Python自然語言處理Jieba分詞
【精讀】自然語言處理基礎之RNN
2019-05-22
自然語言處理RNN
NLP漢語自然語言處理入門基礎知識
2018-10-31
自然語言處理
Python自然語言處理實戰（1）：NLP基礎
2018-07-14
Python自然語言處理
NLP漢語自然語言處理入門基礎知識介紹
2019-01-04
自然語言處理
自然語言處理入門基礎之hanlp詳解
2018-10-31
自然語言處理HanLP
自然語言處理NLP（四）
2018-10-03
自然語言處理
自然語言處理(NLP)概述
2018-08-11
自然語言處理
HanLP 自然語言處理 for nodejs
2019-04-24
HanLP自然語言處理NodeJS
自然語言處理（NLP）系列（一）——自然語言理解（NLU）
2023-02-01
自然語言處理
[譯] 自然語言處理真是有趣！
2018-08-10
自然語言處理
自然語言處理:分詞方法
2018-03-29
自然語言處理分詞
自然語言處理的最佳實踐
2019-10-28
自然語言處理
【自然語言處理篇】--Chatterbot聊天機器人
2018-07-10
自然語言處理機器人
Go語言基礎-錯誤處理
2024-10-05
Go
Pyhanlp自然語言處理中的新詞識別
2019-02-15
HanLP自然語言處理
基於圖深度學習的自然語言處理方法和應用
2022-05-01
深度學習自然語言處理
處理器基礎知識
2022-11-24
自然語言處理NLP快速入門
2018-10-24
自然語言處理
配置Hanlp自然語言處理進階
2018-12-07
HanLP自然語言處理
自然語言處理之jieba分詞
2020-08-18
自然語言處理Jieba分詞
人工智慧 (06) 自然語言處理
2019-12-19
人工智慧自然語言處理
自然語言處理與情緒智慧
2024-08-25
自然語言處理
Pytorch系列:（六）自然語言處理NLP
2021-05-21
PyTorch自然語言處理
牛津大學xDeepMind自然語言處理第13講語言模型（3）
2018-10-08
自然語言處理模型
中國語文（自然語言處理）作業
2024-08-22
自然語言處理
自然語言處理之：搭建基於HanLP的開發環境
2018-11-09
自然語言處理HanLP開發環境
Pytext 簡介——Facebook 基於 PyTorch 的自然語言處理 (NLP) 框架
2018-12-26
PyTorch自然語言處理框架
CCAI 2020 | 周明：自然語言處理大有可為
2020-08-12
AI自然語言處理
《NLP漢語自然語言處理原理與實踐》學習四
2018-09-14
自然語言處理
精通Python自然語言處理 2 ：統計語言建模
2018-05-28
Python自然語言處理
自然語言處理中的語言模型預訓練方法
2018-10-22
自然語言處理模型
入門自然語言處理必看：圖解詞向量
2019-08-28
自然語言處理圖解
自然語言處理NLP（6）——詞法分析
2019-02-26
自然語言處理詞法分析
自然語言處理怎麼最快入門？
2018-11-28
自然語言處理
精通Python自然語言處理 1 ：字串操作
2018-05-28
Python自然語言處理字串
深度解析自然語言處理之篇章分析
2023-11-08
自然語言處理
自然語言處理（NLP）路線圖 - kdnuggets
2020-11-08
自然語言處理

【自然語言處理篇】--以NLTK為基礎講解自然語⾔處理的原理和基礎知識

相關文章