將使用jieba分詞的語料庫轉化成TFIDF向量

d_benhua發表於2020-12-09

原文網址 : https://blog.csdn.net/d_benhua/article/details/110914269

Jieba分詞

二、使用jieba元件對分類語料庫分詞

本文參考連結：https://blog.csdn.net/SA14023053/article/details/52083399
jieba元件參考連結：https://github.com/fxsjy/jieba

承接上文“Preprocessing Chinese Text”

此文對分類語料庫檔案進行預處理和分詞並且去除停用詞
中文語料庫為復旦大學中文語料庫test_corpus中C7-History的C7-History001.txt、C7-History002.txt、C7-History004.txt。
停用詞表為中文停用詞表

資料檔案下載連結：https://github.com/JackDani/Preprocessing_Chinese_Text

text_mining
- text_corpus_samll 目錄：原語料庫路徑，包含語料庫檔案。
- text_corpus_pos 目錄：預處理後語料庫路徑。
- text_corpus_segment 目錄：分詞後語料庫路徑。
- text_corpus_dropstopword 目錄：去除停用詞後語料庫路徑。
- text_corpus_dict 目錄：生成的字典檔案路徑。
- text_corpus_bow 目錄：生成的bow向量檔案路徑。
- text_corpus_tfidf 目錄：生成的tfidf向量儲存路徑。
- Test 目錄：python處理檔案。
- - corpus_pos.py 檔案：語料庫預處理執行檔案。
- - corpus_segment.py 檔案：語料庫分詞執行檔案。
- - corpus_dropstopword.py 檔案：語料庫去除停用詞執行檔案。
- - corpus_tfidf.py 檔案：已分詞語料庫轉為tfidf向量執行檔案。
- stopword 目錄：停用詞路徑。
- README.txt

1. 只保留中文

去除其他所有非中文字元

#分類語料預處理執行檔案

#分類語料庫儲存在text_corpus_small目錄
#預處理後分類語料庫儲存到text_corpus_pos目錄
# _*_ coding: utf-8 _*_



#以下進行只保留漢字操作
import os

#分類語料庫路徑
small_path = "C:\\Users\\Wu\\Desktop\\Now Go it\\text_mining\\text_corpus_small"+"\\"

#預處理後分類語料庫路徑
pos_path = "C:\\Users\\Wu\\Desktop\\Now Go it\\text_mining\\text_corpus_pos"+"\\"

# 以上路徑為檔案實際儲存路徑

def is_chinese(uchar):  #判斷是否為中文
    #if uchar >= u'\u4e00' and uchar <= u'\u9fa5':
    if uchar >= u'\u4e00' and uchar <= u'\u9fff': # U+4E00～U+9FA5
        return True
    else:
        return False

def format_str(ustr):  #去除非中文函式
    ex_str = ''
    for i in ustr:
        if is_chinese(i):
            ex_str = ex_str + i
    return ex_str

file_list = os.listdir(small_path) #獲取small_path下的所有檔案
for file_path in file_list:  #遍歷所有檔案
    file_name = small_path + file_path #得到檔案全路徑
    file_read = open(file_name,'r',encoding = 'gbk',errors = 'ignore') # 開啟一個檔案GB2312 < GBK < GB18030  ,encoding = 'utf-8'
    # errors = 'ignore'對處理檔案時的錯誤進行忽略處理
    # 解決方法連結參考；https://stackoverflow.com/questions/12468179/unicodedecodeerror-utf8-codec-cant-decode-byte-0x9c
    # python編碼參考連結：https://docs.python.org/3/howto/unicode.html#the-unicode-type
    
    #有錯提示
    # file_read = open(file_name,'r',encoding = 'gbk')
    # file_read = open(file_name,'r')
    # 報錯結果：UnicodeDecodeError: 'gbk' codec can't decode byte 0xa4 in position 23677: illegal multibyte sequence
    
    
    raw_corpus = file_read.read() #讀取為預處理語料庫
    # 參函式傳入的是每一句話
    #pos_corpus = []
    #for line in raw_corpus:
        #pos_corpus.append(format_str(line))
    pos_corpus = format_str(raw_corpus)

    # 得出預處理後分類語料庫目錄
    pos_dir = pos_path
    if not os.path.exists(pos_dir):  #如果沒有建立則建立
        os.makedirs(pos_dir)
    file_write = open(pos_dir + file_path,'w') #建立或開啟預處理後語料檔案，檔名與未預處理語料檔案相同
    file_write.write(str(pos_corpus)) #將預處理結果寫入檔案
    file_read.close() #關閉開啟的檔案
    file_write.close() #關閉寫入的檔案
# end for


print("預處理成功。")

知識點

1. os模組

os模組提供了非常豐富的方法用來處理檔案和目錄。

參考連結：
https://www.runoob.com/python/os-file-methods.html http://kuanghy.github.io/python-os/
http://python.usyiyi.cn/python_278/library/os.html

2. os.listdir(path)

用於返回值指定的資料夾path包含的檔案或資料夾的名字的列表。
不包括_.和_…即使在檔案中。
只支援Unix和Windows下使用。

語法

os.listdir(path)

引數

返回值

返回指定路徑下的檔案和資料夾的列表。

參考連結：https://www.runoob.com/python3/python3-os-listdir.html

# e.g.
# -*- coding: UTF-8 -*-

import os, sys

#開啟檔案
path = "C:\\Users\\Wu\\Desktop\\Now Go it"
dirs = os.listdir(path)

#輸出所有檔案和資料夾
for file in dirs:
    print(file)

3. os.path

os.path模組主要用於獲取檔案的屬性

模組常用方法：

os.path.exists(path) #路徑存在則返回True，損壞返回False

參考連結：https://www.runoob.com/python3/python3-os-path.html

# e.g.

import os

path = "C:\\Users\\Wu\\Desktop\\Now Go it\\my"  #路徑存在
print(os.path.exists(path))

path1 = "C:\\Users\\Wu\\Desktop\\Now Go it\\m1"  #不存在路徑
print(os.path.exists(path1))

4. os.makedirs(path, mode = 0o777)

遞迴建立目錄(資料夾)。
若子目錄建立失敗或已存在，則會丟擲一個OSError異常，Windows上Error 183 即為目錄已經存在的異常錯誤。
如果第一個引數 path 只有一級，則 mkdir() 函式相同。
遞迴資料夾建立函式。像mkdir(), 但建立的所有intermediate-level資料夾需要包含子資料夾。

語法

os.makedirs(path, mode = 0o777)。

引數

path – 需要遞迴建立的目錄，可以是相對或者絕對路徑。
mode – 許可權模式。

返回值

該方法沒有返回值。

參考連結：https://www.runoob.com/python/os-makedirs.html

# e.g.例項

#_*_ coding: UTF-8 _*_

import os

#建立的目錄(即Windows下的資料夾)
path = "C:\\Users\\Wu\\Desktop\\Now Go it\\my"
for i in range(0,4):
    os.makedirs(path + "\\" + str(i))
print("路徑被建立。")

5. open()

用於開啟一個檔案，並返回檔案物件。若無法開啟，則丟擲OSError
wrong：使用open()方法則一定要呼叫close()方法

語法

open(file, mode = ‘r’)
完整語法格式open(file, mode=‘r’, buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)

引數

file: 必需，檔案路徑（相對或者絕對路徑）。
mode: 可選，檔案開啟模式
buffering: 設定緩衝
encoding: 一般使用utf8
errors: 報錯級別
newline: 區分換行符
closefd: 傳入的file引數型別
opener: 設定自定義開啟器，開啟器的返回值必須是一個開啟的檔案描述符。

6. read()

從檔案讀取指定的位元組數，如果未給定或為負則讀取所有。

語法

fileobject.read([size])

引數

size – 可選引數，從檔案中讀取的位元組數，預設為-1，表示讀取整個檔案

返回值

返回從字串中讀取的位元組。

7. write()

向檔案中寫入指定字串
在檔案關閉前或緩衝區重新整理前，字串內容儲存在緩衝區中，此時檔案中看不到寫入的內容。
如果檔案開啟模式帶 b，那寫入檔案內容時，str (引數)要用 encode 方法轉為 bytes 形式，否則報錯：TypeError: a bytes-like object is required, not ‘str’。

語法

fileobject.write([str])

引數

str – 要寫入檔案的字串

返回值

返回寫入的字元長度

8. close()

關閉一個已開啟的檔案。
開啟一個檔案並處理完之後一定要進行關閉檔案操作。

語法

fileobject.close()

引數

無。

返回值

9. readline()

讀取所有行（直到結束符EOF）並返回列表。

語法

fileobject.readlines()

引數

返回值

返回列表，包含所有的行。

參考文獻：
https://www.runoob.com/python3/python3-file-methods.html
https://blog.csdn.net/weixin_39643135/article/details/91348983
https://blog.csdn.net/weixin_40449300/article/details/79143971

# e.g. 例項
path = "C:\\Users\\Wu\\Desktop\\Now Go it\\1.doc"
f = open(path,'r',encoding = 'utf-8')
f_read = f.read()
print(f_read)
f.close()

# e.g. 例項
path = "C:\\Users\\Wu\\Desktop\\Now Go it\\1.doc"
f = open(path,'rb') #二進位制形式開啟.doc檔案
f_read = f.read()
print(f_read)
f.close()

!pip install python-docx #匯入模組python-docx
# %pip install python-docx #匯入模組到核心

#!pip install python-docx  #匯入模組python-docx
# e.g. 例項
import docx
path = "C:\\Users\\Wu\\Desktop\\Now Go it\\my\\test.docx"
f = docx.Document(path)
for item in f.paragraphs:
    print(item.text)
#此方法成功輸出.docx檔案

# e.g. 例項
path = "C:\\Users\\Wu\\Desktop\\Now Go it\\my\\test.txt"
f = open(path,'r',encoding = 'utf-8') #二進位制形式開啟.txt檔案
f_read = f.read()
print(f_read)
f.close()

2. 進行jieba分詞

#分類語料分詞執行檔案
#分詞所需預處理後的檔案儲存在text_corpus_pos目錄
#分詞後檔案儲存到text_corpus_segment目錄
# _*_ coding: utf-8 _*_


import os
import jieba


# 分類語料庫路徑
corpus_path = "C:\\Users\\Wu\\Desktop\\Now Go it\\text_mining\\text_corpus_pos"+"\\"

# 分詞後分類語料庫路徑
seg_path = "C:\\Users\\Wu\\Desktop\\Now Go it\\text_mining\\text_corpus_segment"+"\\"

file_list = os.listdir(corpus_path) # 獲取corpus_path下的所有檔案
for file_path in file_list: # 遍歷所有檔案
    #print("輸出" + file_path)
    file_name = corpus_path + file_path # 拼出檔案全路徑
    file_read = open(file_name,'rb') # 開啟一個檔案

    raw_corpus = file_read.read() # 讀取為分詞語料庫
    seg_corpus = jieba.cut(raw_corpus) # 結巴分詞操作

    # 拼出分詞後分類語料庫目錄
    seg_dir = seg_path
    if not os.path.exists(seg_dir):  # 如果沒有建立
        os.makedirs(seg_dir)
    file_write = open(seg_dir + file_path,'w')  # 建立分詞後語料檔案，檔名與未分詞語料相同
    file_write.write("\n".join(seg_corpus)) #用換行符將分詞結果分開並寫入到分詞後語料檔案中

    file_read.close() #關閉開啟的檔案
    file_write.close() # 關閉寫入的檔案

print("中文語料分詞成功。")

知識點

1.join()

將序列中的元素以指定的字元連結生成一個新的字串。

語法

str.join(sequence)

引數

sequence – 要連線的元素序列。

返回值

返回處理後的新字元。

參考連結：
https://www.runoob.com/python3/python3-string-join.html
https://www.runoob.com/python3/python3-string.html

# e.g. 例項

s1 = "-"
s2 = ""
seq = ("r", "u", "n", "o", "o", "b") # 字串序列
print (s1.join( seq ))
print (s2.join( seq ))

3. 去除停用詞

# 去除停用詞

# _*_ coding: utf-8 _*_

import os,pprint

#分詞後的分類語料庫路徑
seg_path = "C:\\Users\\Wu\\Desktop\\Now Go it\\text_mining\\text_corpus_segment"+"\\"

#去除停用詞後分類語料庫路徑
dropstopword_path = "C:\\Users\\Wu\\Desktop\\Now Go it\\text_mining\\text_corpus_dropstopword"+"\\"

#停用詞儲存路徑
stopword_path = "C:\\Users\\Wu\\Desktop\\Now Go it\\text_mining\\stopword\\中文停用詞庫.txt"

#載入本地停用詞
fi=open(stopword_path,'r', encoding='UTF-8')
txt = fi.readlines()  #行讀取整個檔案
stopwords=[]
for w in txt:
    w = w.replace('\n','')  #將換行符替換掉
    stopwords.append(w)   #形成停用詞列表


# 去掉文字中的停用詞
def drop_stopwords(contents, stopwords):
    contents_clean = []
    for word in contents:
        if word in stopwords:
            continue
        contents_clean.append(word)
    return contents_clean


# 對檔案操作
file_list = os.listdir(seg_path) #獲取seg_path目錄下的所有檔案
for file_path in file_list: #遍歷所有檔案
    file_name = seg_path + file_path # 得到檔案全路徑
    file_read = open(file_name,'r') #開啟一個檔案

    #獲得待去除停用詞的已分詞語料庫列表（可參照停用詞列表的形成方法）
    txt_corpus = file_read.readlines() #按行讀取為去除停用詞語料庫
    raw_corpus = []
    for s in txt_corpus:
        s = s.replace('\n','')
        raw_corpus.append(s)
    # pprint.pprint(raw_corpus)
    drop_corpus = drop_stopwords(raw_corpus, stopwords) #去除停用詞

    #得出去除停用詞後分類語料庫的目錄
    drop_dir = dropstopword_path
    if not os.path.exists(drop_dir): #如果沒有建立則建立
        os.makedirs(drop_dir)

    file_write = open(drop_dir + file_path,'w')  #建立或寫入去除停用詞後語料庫檔案
    #file_write.write(str(drop_corpus)) #將去除停用詞結果寫入檔案
    file_write.write("\n".join(drop_corpus))
    file_read.close() #關閉開啟的檔案
    file_write.close() #關閉寫入的檔案

print("去除停用詞成功。")

知識點

見jupyter notebook中的File chinese text preprocessing筆記

4. 轉化為TF-IDF向量

將已分詞文字檔案轉化為向量

# 轉化為向量
# _*_ coding: UTF-8 _*_

import os,pprint

#去除停用詞後分類語料庫路徑
dropstopword_path = "C:\\Users\\Wu\\Desktop\\Now Go it\\text_mining\\text_corpus_dropstopword"+"\\"

#轉化為bow向量的儲存路徑
bow_path = "C:\\Users\\Wu\\Desktop\\Now Go it\\text_mining\\text_corpus_bow"+"\\"

# 轉化為tfidf向量的儲存路徑
tfidf_path = "C:\\Users\\Wu\\Desktop\\Now Go it\\text_mining\\text_corpus_tfidf"+"\\"

# 字典（詞：ID）的儲存路徑
dict_path = "C:\\Users\\Wu\\Desktop\\Now Go it\\text_mining\\text_corpus_dict"+"\\"

from gensim import corpora
from gensim import models

# 將檔案內容輸出成[[...],[...],[...],...]形式，獲取原語料庫
original_corpus = []
file_list = os.listdir(dropstopword_path)# 獲取dropstopword目錄下的所有檔案
for file_path in file_list: #遍歷所有檔案
    file_name = dropstopword_path + file_path  #得到檔案的全路徑
    file_read = open(file_name, 'r')  #開啟一個檔案
    
    # 對每個檔案輸出操作
    str_corpus = file_read.readlines()
    text_corpus = []
    for s in str_corpus:
        s = s.replace('\n','')
        text_corpus.append(s)
    original_corpus.append(text_corpus)

    file_read.close() #關閉開啟的檔案
# end for
#pprint.pprint(original_corpus)


dictionary = corpora.Dictionary(original_corpus)
bow_vec = [dictionary.doc2bow(text) for text in original_corpus]
#pprint.pprint(dictionary.token2id)


# 儲存字典（詞：ID）到檔案中
dict_dir = dict_path # 得出bow向量儲存路徑
if not os.path.exists(dict_dir):  #如果目錄不存在則建立
    os.makedirs(dict_dir)
file_write_bow = open(dict_dir + "dict", 'w') #建立或寫入bow向量檔案
file_write_bow.write(str(dictionary.token2id))  #寫入bow向量
file_write_bow.close() # 關閉寫入的bow檔案

# 儲存bow_vec到檔案中
bowvec_dir = bow_path # 得出bow向量儲存路徑
if not os.path.exists(bowvec_dir):  #如果目錄不存在則建立
    os.makedirs(bowvec_dir)
file_write_bow = open(bowvec_dir + "bow_vec", 'w') #建立或寫入bow向量檔案
file_write_bow.write(str(bow_vec))  #寫入bow向量
file_write_bow.close() # 關閉寫入的bow檔案


tfidf = models.TfidfModel(bow_vec)

# 對每個檔案TFIDF向量化
original_corpus = []
file_list = os.listdir(dropstopword_path)# 獲取dropstopword目錄下的所有檔案
for file_path in file_list: #遍歷所有檔案
    file_name = dropstopword_path + file_path  #得到檔案的全路徑
    file_read = open(file_name, 'r')  #開啟一個檔案
    
    # 對每個檔案輸出操作
    str_corpus = file_read.readlines()
    text_corpus = []
    for s in str_corpus:
        s = s.replace('\n','')
        text_corpus.append(s)
    #pprint.pprint(text_corpus)
    # pprint.pprint(text_corpus.split())
    train_list = dictionary.doc2bow(text_corpus)
    tfidf_vec = tfidf[train_list]
    #pprint.pprint(tfidf_vec)
    
    
    # 得出tfidf向量儲存路徑
    tfidf_dir = tfidf_path
    if not os.path.exists(tfidf_dir):  #如果目錄不存在則建立
        os.makedirs(tfidf_dir)

    file_write_tfidf = open(tfidf_dir + file_path, 'w')  # 建立或寫入tfidf向量
    file_write_tfidf.write(str(tfidf_vec))  #寫入tfidf向量
    
    file_read.close() #關閉開啟的檔案
    file_write_tfidf.close() #  關閉寫入的tfidf檔案
# end for

print("\ntfidf向量轉化成功！")

5. 在訓練語料庫的字典空間下，將測試語料庫轉化為tfidf向量的過程總結

1. 獲得分詞語料庫。對訓練語料庫*train_corpus進行分詞並去除停用詞得到分詞語料庫participle_corpus*。
1. 獲得字典。使用gensim.corpora.Dictionary(participle_corpus)通過分詞語料庫建立字典*dictionary，（即建立向量空間，字典字元token*的個數代表向量空間的維數）。（使用字典的dictionary.token2id方法檢視“詞”與“ID”的一一對應）
1. 獲得bow向量。使用字典的dictionary.doc2bow(participle_corpus)方法將分詞語料庫轉化為詞袋模型*bag-of-word的bow向量bow_vector*。（participle_corpus一層列表）
1. 訓練tfidf模型。使用gensim.models.TfidfModel(bow_vector)通過bow向量訓練tfidf模型*tfidf model*。（bow_vector二層向量列表）
1. 獲得tfidf向量。在字典集下，將測試語料庫*test_corpus轉化為bow向量（note：此處是測試語料庫轉化後得到的bow向量），再使用訓練好的模型tfidf[bow_vector]將bow向量轉化為tfidf向量tfidf_vector*。（bow_vector一層向量列表）
1. 對向量後續運算。對tfidf向量進行其他演算法運算，如文章相似度計算，呼叫sklearn庫演算法等等。

擴充套件知識點

# Examples - 匯入gensim.downloader模組作為api介面載入Document

import gensim.downloader as api
from gensim.models import TfidfModel
from gensim.corpora import Dictionary

dataset = api.load("text8")
dct = Dictionary(dataset)  # fit dictionary
corpus = [dct.doc2bow(line) for line in dataset]  # convert corpus to BoW format

model = TfidfModel(corpus)  # fit model
vector = model[corpus[0]]  # apply model to the first corpus document
print(vector)

    # 使用時載入模型
    tfidf = models.TfidfModel.load("my_model.tfidf")
    
    words = "歷史學 中國 古老 二十世紀 危機 王者 風衣".lower().split()
    pprint.pprint(tfidf[dictionary.doc2bow(words)])

Python：Python 中 jieba 庫的使用（中文分詞）
2018-05-12
PythonJieba中文分詞
python jieba庫，句子分詞
2024-08-25
PythonJieba分詞
Laravel 中使用 PHP 分詞庫 (jieba) 和 (scws)
2018-06-24
LaravelPHP分詞Jieba
自然語言處理之jieba分詞
2020-08-18
自然語言處理Jieba分詞
python 中文分詞包 jieba
2020-12-18
Python中文分詞Jieba
jieba 詞性標註 & 並行分詞
2020-12-19
Jieba詞性標註並行分詞
如何將中文文件語料訓練成詞向量
2020-12-22
python使用jieba實現中文文件分詞和去停用詞
2019-06-19
PythonJieba分詞
Python 自然語言處理（基於jieba分詞和NLTK）
2018-05-11
Python自然語言處理Jieba分詞
[Python] 基於 jieba 的中文分詞總結
2021-02-21
PythonJieba中文分詞
文字挖掘之語料庫、分詞、詞頻統計
2024-05-20
分詞
java版JieBa分詞原始碼走讀
2019-03-01
JavaJieba分詞原始碼
Hanlp分詞例項：Java實現TFIDF演算法
2018-11-14
HanLP分詞Java演算法
java實現將資料庫資料轉化成excel表格顯示出來
2019-02-23
Java資料庫Excel
如何將JavaScript轉化成Swift？（一）
2019-03-29
JavaScriptSwift
如何將JavaScript轉化成Swift？（三）
2019-04-08
JavaScriptSwift
如何將JavaScript轉化成Swift？（二）
2019-04-03
JavaScriptSwift
親手做的詞向量分佈圖
2024-08-05
[js常用]文字轉化成語音
2018-12-01
JS
將網路圖片轉化成bitmap
2018-09-10
Chroma向量資料庫使用案例
2024-03-24
資料庫
使用多執行緒查詢百萬條使用者資料將漢字轉化成拼音
2018-08-27
執行緒
JB的Python之旅-資料分析篇-jieba&wordcloud(詞雲)
2018-06-12
PythonJiebaCloud
淺談文字詞向量轉換的機制embedding
2018-03-12
AutoGPT放棄使用向量資料庫
2023-10-12
GPT資料庫
C語言中寫一個程式將浮點型轉化成字元型輸出
2018-04-30
C語言字元
使用cjieba(結巴分詞庫)實現php擴充套件中文分詞
2019-02-16
JiebaPHP套件中文分詞
構建RAG應用-day01: 詞向量和向量資料庫文件預處理
2024-04-17
資料庫
C#使用詞嵌入向量與向量資料庫為大語言模型(LLM)賦能長期記憶實現私域問答機器人落地
2023-05-09
C#資料庫模型機器人
向量資料庫
2024-11-24
資料庫
中文分詞研究難點-詞語劃分和語言規範
2019-09-04
中文分詞
詞向量入門
2020-05-27
python將中文數字轉化成阿拉伯數字
2021-03-11
Python
入門自然語言處理必看：圖解詞向量
2019-08-28
自然語言處理圖解
如何將文字轉換為向量？(方法二)
2024-07-16
如何將文字轉換為向量？（方法三）
2024-07-16
中文分詞原理及常用Python中文分詞庫介紹
2018-04-04
中文分詞Python
科大訊飛語音轉文字以及中文分詞的Java測試程式碼
2019-08-19
中文分詞Java