【NLP學習筆記】（三）gensim使用之相似性查詢（SimilarityQueries）

alexbyy發表於2018-12-12

原文網址 : https://flycode.co/archives/166114

相似性查詢（Similarity Queries）

本文主要翻譯自https://radimrehurek.com/gensim/tut3.html
在之前的教程語料和向量空間和主題和轉換中，我們學會了如何在向量空間模型中表示語料和如何在不同的向量空間之間轉換。實際工作中，這樣做的一個最常見的目的是比較兩個文件之間的相似性或比較某一個文件與其它文件的相似性（比如使用者查詢已經索引的文件中的某一個文件）

載入字典和語料

與上一章相同，首先載入第一章中儲存的字典和語料。

from gensim import corpora, models, similarities
import os
if(os.path.exists(`./gensim_out/deerwester.dict`)):
    dictionary = corpora.Dictionary.load(`./gensim_out/deerwester.dict`)
    corpus = corpora.MmCorpus(`./gensim_out/deerwester.mm`)
    print("使用之前已經儲存的字典和語料向量")
else:
    print("請先通過第一章生成deerwester.dict和deerwester.mm")

第一步

定義模型LSI，並將語料corpus轉換為索引

lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)

index = similarities.MatrixSimilarity(lsi[corpus])
index.save(`./gensim_out/deerwester.index`) #儲存訓練後的index
index = similarities.MatrixSimilarity.load(`./gensim_out/deerwester.index`)#從已儲存的檔案中載入index。

第二步

假設我們要查詢新文字 `human computer interaction`。我們期望得出與新文字最相思的三個文字。

doc = `human computer interaction`
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsi[vec_bow]
print(vec_lsi)

第三步

比較新文字vec_lsi與語料庫的相似性

sims = index[vec_lsi]
print(list(enumerate(sims))) #列印結果(document_number, document_similarity) 2-tuples

上面結果為：
[(0, 0.99809301), (1, 0.93748635), (2, 0.99844527), (3, 0.9865886), (4, 0.90755945),(5, -0.12416792), (6, -0.1063926), (7, -0.098794639), (8, 0.05004178)]

(0, 0.99809301)的意思是第0篇文章與新文件的相似性為 0.99809301

將上面結果按相似性降序排列

sims = sorted(enumerate(sims), key = lambda item : -item[1])
print(sims)

結果：

[(2, 0.99844527), # The EPS user interface management system
(0, 0.99809301), # Human machine interface for lab abc computer applications
(3, 0.9865886), # System and human system engineering testing of EPS
(1, 0.93748635), # A survey of user opinion of computer system response time
(4, 0.90755945), # Relation of user perceived response time to error measurement
(8, 0.050041795), # Graph minors A survey
(7, -0.098794639), # Graph minors IV Widths of trees and well quasi ordering
(6, -0.1063926), # The intersection graph of paths in trees
(5, -0.12416792)] # The generation of random binary unordered trees

可以看出與文件“human computer interface”最相似的三篇文章分別是第2篇、第0篇、第三篇。

【NLP學習筆記】（一）Gensim基本使用方法
2018-12-11
筆記
MYSQL學習筆記26: 多表查詢|子查詢
2024-03-14
MySql筆記
（MySQL學習筆記）分頁查詢
2020-12-12
MySql筆記
MYSQL學習筆記25: 多表查詢(子查詢)[標量子查詢,列子查詢]
2024-03-10
MySql筆記
oracle學習筆記（十一）高階查詢
2019-05-19
Oracle筆記
SpringBoot學習筆記13——MybatisPlus條件查詢
2018-11-05
Spring Boot筆記MyBatis
mysql，where條件查詢等學習筆記
2018-07-25
MySql筆記
資料庫學習筆記之查詢表
2021-01-03
資料庫筆記
MYSQL學習筆記24: 多表查詢(聯合查詢,Union, Union All)
2024-03-10
MySql筆記
第一個完整的spring查詢功能學習筆記【Spring工程學習筆記(二)】
2019-02-15
Spring筆記
MYSQL學習筆記6: DQL條件查詢（where）
2024-03-08
MySql筆記
MYSQL學習筆記8: DQL分組查詢(group by)
2024-03-08
MySql筆記
ES[7.6.x]學習筆記（十）聚合查詢
2020-05-26
筆記
Mybatis學習筆記 3：Mybatis 多種條件查詢
2019-02-14
MyBatis筆記
MYSQL學習筆記11: DQL查詢執行順序
2024-03-09
MySql筆記
SQL學習(三）複雜查詢
2020-12-20
SQL
Redis 學習筆記（五）高可用之主從模式
2022-02-11
Redis筆記模式
React 學習筆記【三】
2019-01-09
React筆記
cmake 學習筆記(三)
2018-12-07
筆記
goLang學習筆記(三）
2018-08-20
Golang筆記
unity學習筆記（三）
2024-09-01
Unity筆記
ONNXRuntime學習筆記(三)
2022-05-01
筆記
Python學習筆記（三）
2020-12-09
Python筆記
MyBatis學習筆記（四）使用map實現查詢和插入
2020-10-25
MyBatis筆記
Linux 學習筆記--環境變數與檔案查詢
2020-04-07
Linux筆記變數
資料庫學習（三）基本查詢
2019-01-22
資料庫
MySQL學習（三） SQL基礎查詢
2018-07-11
MySql
NLP：Gensim庫之word2vec
2018-12-30
Redis學習筆記（三）字典
2020-05-12
Redis筆記
TS學習筆記（三）：類
2019-04-20
筆記
CANopen學習筆記（三）NMT
2024-08-27
筆記
c++學習筆記（三）
2024-06-24
C++筆記
react native學習筆記（三）
2018-04-03
React Native筆記
MySQL資料庫學習筆記02（事務控制，資料查詢）
2020-10-18
MySql資料庫筆記
架構學習筆記系列三
2018-06-25
架構筆記
ES6 學習筆記三
2019-11-04
筆記
Spark學習筆記（三）-Spark Streaming
2020-06-24
Spark筆記
Python爬蟲學習筆記(三)
2021-01-30
Python爬蟲筆記

【NLP學習筆記】（三）gensim使用之相似性查詢（SimilarityQueries）

相似性查詢（Similarity Queries）

載入字典和語料

第一步

第二步

第三步

相關文章