記錄NLTK安裝使用全過程--python

airl發表於2022-03-28

原文網址 : https://www.cnblogs.com/hjk-airl/p/16066851.html

前言

之前做實驗用到了情感分析，就下載了一下，這篇部落格記錄使用過程。

下載安裝到實戰詳細步驟

NLTK下載安裝

先使用pip install nltk 安裝包
然後執行下面兩行程式碼會彈出如圖得GUI介面，注意下載位置，然後點選下載全部下載了大概3.5G。

import nltk
nltk.download()!

注意點：可能由於網路原因訪問github卡頓導致，不能正常彈出GUI進行下載，可以自己去github下載
網址：https://github.com/nltk/nltk_data/tree/gh-pages/packages

下載成功後檢視是否可以使用，執行下面程式碼看看是否可以呼叫brown中的詞庫

from nltk.corpus import brown

print(brown.categories())  # 輸出brown語料庫的類別
print(len(brown.sents()))  # 輸出brown語料庫的句子數量
print(len(brown.words()))  # 輸出brown語料庫的詞數量

'''
結果為：
['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 
'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 
'science_fiction']
57340
1161192
'''

這時候有可能報錯，說在下面資料夾中沒有找到nltk_data
把下載好的檔案解壓在複製到其中一個資料夾位置即可，注意檔名，讓後就能正常使用！

實戰：運用自己的資料進行操作

一、使用自己的訓練集訓練和分析

可以看到我的訓練集和程式碼的結構是這樣的：pos和neg裡面是txt文字
連結：https://pan.baidu.com/s/1GrNg3ziWJGhcQIWBCr2PMg
提取碼：1fb8

import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
import os
from nltk.corpus import stopwords
import pandas as pd


def extract_features(word_list):
    return dict([(word, True) for word in word_list])

#停用詞
stop = stopwords.words('english')
stop1 = ['!', ',' ,'.' ,'?' ,'-s' ,'-ly' ,' ', 's','...']
stop = stop1+stop
print(stop)

#讀取txt文字
def readtxt(f,path):
    data1 = ['microwave']
    # 以 utf-8 的編碼格式開啟指定檔案
    f = open(path+f, encoding="utf-8")
    # 輸出讀取到的資料
    #data = f.read().split()
    data = f.read().split()
    for i in range(len(data)):
        if data[i] not in stop:
            data[i] = [data[i]]
            data1 = data1+data[i]
    # 關閉檔案
    f.close()
    del data1[0]
    return data1


if __name__ == '__main__':

    # 載入積極與消極評論  這些評論去掉了一些停用詞，是在readtxt韓碩裡處理的，
    #停用詞如 i am you a this 等等在評論中是非常常見的，有可能對結果有影響，應該事先去除
    positive_fileids = os.listdir('pos')  # 積極 list型別 42條資料 每一條是一個txt檔案
    print(type(positive_fileids), len(positive_fileids)) # list型別 42條資料 每一條是一個txt檔案
    negative_fileids = os.listdir('neg')#消極 list型別 22條資料 每一條是一個txt檔案自己找的一些資料
    print(type(negative_fileids),len(negative_fileids))

    # 將這些評論資料分成積極評論和消極評論
    # movie_reviews.words(fileids=[f])表示每一個txt文字里面的內容，結果是單詞的列表：['films', 'adapted', 'from', 'comic', 'books', 'have', ...]
    # features_positive 結果為一個list
    # 結果形如：[({'shakesp: True, 'limit': True, 'mouth': True, ..., 'such': True, 'prophetic': True}, 'Positive'), ..., ({...}, 'Positive'), ...]
    path = 'pos/'
    features_positive = [(extract_features(readtxt(f,path=path)), 'Positive') for f in positive_fileids]
    path = 'neg/'
    features_negative = [(extract_features(readtxt(f,path=path)), 'Negative') for f in negative_fileids]

    # 分成訓練資料集（80%）和測試資料集（20%）
    threshold_factor = 0.8
    threshold_positive = int(threshold_factor * len(features_positive))  # 800
    threshold_negative = int(threshold_factor * len(features_negative))  # 800
    # 提取特徵 800個積極文字800個消極文字構成訓練集  200+200構成測試文字
    features_train = features_positive[:threshold_positive] + features_negative[:threshold_negative]
    features_test = features_positive[threshold_positive:] + features_negative[threshold_negative:]
    print("\n訓練資料點的數量:", len(features_train))
    print("測試資料點的數量:", len(features_test))

    # 訓練樸素貝葉斯分類器
    classifier = NaiveBayesClassifier.train(features_train)
    print("\n分類器的準確性:", nltk.classify.util.accuracy(classifier, features_test))
    print("\n五大資訊最豐富的單詞:")
    for item in classifier.most_informative_features()[:5]:
        print(item[0])

    # 輸入一些簡單的評論
    input_reviews = [
        "works well with proper preparation.",
        ]

    #執行分類器，獲得預測結果
    print("\n預測:")
    for review in input_reviews:
        print("\n評論:", review)
        probdist = classifier.prob_classify(extract_features(review.split()))
        pred_sentiment = probdist.max()
        # 列印輸出
        print("預測情緒:", pred_sentiment)
        print("可能性:", round(probdist.prob(pred_sentiment), 2))

print("結束")

執行結果：這裡的準確性有點高，這是因為我選取的一些資料是非常明顯的表達積極和消極的所以處理結果比較難以相信

<class 'list'> 42
<class 'list'> 22

訓練資料點的數量: 50
測試資料點的數量: 14

分類器的準確性: 1.0

五大資訊最豐富的單詞:
microwave
product
works
ever
service

預測:

評論: works well with proper preparation.
預測情緒: Positive
可能性: 0.77
結束

二、使用自帶庫分析

import pandas as pd

from nltk.sentiment.vader import SentimentIntensityAnalyzer
# 分析句子的情感：情感分析是NLP最受歡迎的應用之一。情感分析是指確定一段給定的文字是積極還是消極的過程。
# 有一些場景中，我們還會將“中性“作為第三個選項。情感分析常用於發現人們對於一個特定主題的看法。
# 定義一個用於提取特徵的函式
# 輸入一段文字返回形如：{'It': True, 'movie': True, 'amazing': True, 'is': True, 'an': True}
# 返回型別是一個dict

if __name__ == '__main__':

    # 輸入一些簡單的評論
    #data = pd.read_excel('data3/microwave1.xlsx')
    name = 'hair_dryer1'
    data = pd.read_excel('../data3/'+name+'.xlsx')
    input_reviews = data[u'review_body']
    input_reviews = input_reviews.tolist()
    input_reviews = [
        "works well with proper preparation.",
        "i hate that opening the door moves the microwave towards you and out of its place. thats my only complaint.",
        "piece of junk. got two years of use and it died. customer service says too bad. whirlpool dishwasher died a few months ago. whirlpool is dead to me.",
        "am very happy with  this"
        ]

    #執行分類器，獲得預測結果
    for sentence in input_reviews:
        sid = SentimentIntensityAnalyzer()
        ss = sid.polarity_scores(sentence)
        print("句子:"+sentence)
        for k in sorted(ss):
            print('{0}: {1}, '.format(k, ss[k]), end='')

        print()
print("結束")

結果：

句子:works well with proper preparation.
compound: 0.2732, neg: 0.0, neu: 0.656, pos: 0.344, 
句子:i hate that opening the door moves the microwave towards you and out of its place. thats my only complaint.
compound: -0.7096, neg: 0.258, neu: 0.742, pos: 0.0, 
句子:piece of junk. got two years of use and it died. customer service says too bad. whirlpool dishwasher died a few months ago. whirlpool is dead to me.
compound: -0.9432, neg: 0.395, neu: 0.605, pos: 0.0, 
句子:am very happy with  this
compound: 0.6115, neg: 0.0, neu: 0.5, pos: 0.5, 
結束

結果解釋：
compound就相當於一個綜合評價，主要和消極和積極的可能性有關
neg：消極可能性
pos：積極可能性
neu：中性可能性

使用pip安裝selenium過程筆記
2020-01-09
筆記
LLM本地部署全過程記錄
2024-05-10
重新記錄一下ArcGisEngine安裝的過程
2024-03-21
Centos7.9 安裝mysql8.4.3-lts 記錄過程
2024-12-02
CentOSMySql
Ubuntu 16.04 安裝 MySQL 8.0 全過程
2019-11-01
UbuntuMySql
redhat 5.4下安裝MYSQL全過程
2021-09-09
RedhatMySql
Python 3安裝IPython過程分享
2020-04-22
Python
VisualStudio(Mac)安裝過程筆記
2020-11-11
Mac筆記
從寫博到出書：過程全記錄
2021-09-09
記錄Mac Pro M1晶片安裝HomeBrew的過程吧
2021-01-08
Mac晶片
Iron Python中使用NLTK庫
2024-02-06
Python
記錄---docker安裝及配置jenkins全流程
2024-12-09
DockerJenkins
【ETL工具】DataX + DataXWeb 初使用過程記錄
2024-09-02
Web
記錄一次CentOS/Linux下安裝vsftp伺服器的過程
2024-08-02
CentOSLinuxFTP伺服器
記錄VMware安裝VMware Tools過程及遇到的一些問題
2022-01-13
selenium安裝過程
2018-05-28
尤拉系統初體驗與編譯安裝FFmpeg的過程記錄
2024-07-16
編譯
GoLand 2020.3 安裝過程設定中文筆記
2020-12-11
GoLand筆記
記錄下學習使用kratos的過程一
2022-05-30
webpack的安裝過程
2018-08-18
Web
【一】TYPORA安裝過程
2024-03-25
安裝wampserver的過程
2021-09-09
Server
SpinalHDL上板過程記錄
2024-06-30
pycharm中安裝和使用sqlite過程詳解
2021-10-19
PyCharmSQLite
記錄VMware虛擬機器安裝winXP時踩坑並解決的過程
2024-11-08
虛擬機
Docker安裝記錄
2020-12-15
Docker
pip安裝python庫時使用國內映象資源加速下載過程
2018-08-10
Python
000 上傳本地庫到Github遠端庫過程全記錄
2022-06-13
Github
Java使用javacv處理影片檔案過程記錄
2024-04-15
Java
docker使用redis過程出現的問題記錄
2021-11-04
DockerRedis
記憶體訪問全過程
2020-05-10
記憶體
記錄一次在尤拉(openEuler22.03LTS-SP4)系統下安裝(踩坑)Freeswitch1.10.11的全過程
2024-07-18
Tigase手動安裝過程
2018-12-06
RabbitMQ安裝過程詳解
2021-09-09
MQ
原始碼包安裝過程
2020-12-06
原始碼
CentOS 7.9 64位使用docker安裝軟體過程
2024-08-23
CentOSDocker
tensorflow安裝使用過程錯誤及解決方法
2021-01-15
MYSQL Group Replication搭建過程記錄
2019-01-23
MySql