資訊檢索(六)-- 文字分析及自動標引(Part 3)
Thesaurus及term自動關聯
- 上文講到的wordnet和hownet都是利用了專家知識和大量的人工整理出來的,那麼,可不可以自動生成相似的詞語呢?
Definition: Two words are similar if they co-occur with similar word(類似word2vec的思想) - 聚類(在資料探勘中講過,在這不多涉及)
-
Partitional clustering
最典型的是k-means:
這是典型的E-M演算法思想。E階段固定引數 θ \theta θ(類中心點位置),M階段最大化隱型變數的概率(每個點屬於哪類)K-means的停止條件:
- A fixed number of iterations.
- Term partition unchanged.
- Centroid positions don’t change.
K-means的優點和缺點:
-
Hierarchical clustering
- Bottom-Up (agglomerative):Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together.
- Top-Down (divisive):Starting with all the data in a single cluster, consider every possible way to divide the cluster into two. Choose the best division and recursively operate on both side.
在hierarchical clustering中衡量類間相似度的方法: - Single linkage (nearest neighbor):the distance between two clusters is determined by the distance of the two closest objects in the different clusters.
- Complete linkage (furthest neighbor):the distances between clusters are determined by the greatest distance between any two objects in the different clusters (i.e., by the “furthest neighbors”).
- Group average linkage:the distance between two clusters is calculated as the average distance between all pairs of objects in the two different clusters.
hierarchical clustering 的優缺點:- No need to specify the number of clusters in advance.
- Hierarchal nature maps nicely onto human intuition for some domains
- They do not scale well: time complexity of at least O(sqr(n)), where n is the number of total objects.
- Like any heuristic search algorithms, local optima are a problem.
- Interpretation of results is (very) subjective
詞語的聚類就是按下面的詞-詞相似度來聚類,以達到高的類內相似度和低的類間相似度
Construction of term phrases
"重症流感藥物"在這裡看成了一個短語。
用統計方法確定短語:
M
I
k
h
=
P
a
i
r
k
h
T
F
k
T
F
h
MI_{kh} = \frac{Pair_{kh}}{TF_kTF_h}
MIkh=TFkTFhPairkh
其中k、h是詞語,Pair_kh表示kh相鄰的次數,TF代表總詞頻。
總結
Automatic indexing process capable of producing high-performance retrieval results:
- (1) Terms in the medium-frequency ranges with positive discrimination values are used as index terms directly without further transformation.
- (2) The broad high-frequency terms with negative discrimination values are either discarded or incorporated into phrases with low-frequency terms.
- (3) The narrow low-frequency terms with discrimination values close to zero are broadened by inclusion into thesaurus categories.
Expand query:
Term ambiguity may introduce irrelevant statistically correlated terms.
e.g. “Apple computer” ->“Apple red fruit computer”
So, only expand query with terms that are similar to all terms in the query.
e.g. “fruit” not added to “Apple computer” since it is far from “computer.” “fruit” added to “apple pie” since “fruit” close to both “apple” and “pie.
相關文章
- 影象檢索:資訊檢索評價指標mAP指標
- 管理ASM 檢索ocr資訊 禁用RAC自動啟動ASM
- 資訊檢索
- 如何自動檢索客戶資訊斷融,CRM系統?
- PGA自動管理原理深入分析及效能調整(六)
- Java 文字檢索神器 "正規表示式"Java
- 全文字檢索的應用(2)(轉)
- 全文字檢索的應用(1)(轉)
- 【任務】資訊檢索.MOOC學習
- Android文字時鐘 — Part3Android
- 【筆記】黃如花.資訊檢索.學習心得筆記
- 第二章 XML資訊檢索基礎XML
- 資訊檢索與排序模型之布林模型排序模型
- 對全站資訊檢索的一些思路
- 基於資訊檢索和深度學習結合的單元測試用例斷言自動生成深度學習
- django入門-檢視-part3Django
- SpringBoot(3)-MVC自動配置及自定義檢視控制器Spring BootMVC
- 影象處理入門:目標檢測和影象檢索綜述
- 資訊檢索&FAQ硬核技術!飛槳開源百度自研SimNet模型模型
- 擊敗二分檢索演算法——插值檢索、快速檢索演算法
- 檢視JVM預設引數及微調JVM啟動引數JVM
- Information Retrieval(資訊檢索)筆記02:Preprocessing and Tolerant RetrievalORM筆記
- Outlook中邊檢索,邊移動mailAI
- 自動檢查RAID 資訊的一個指令碼AI指令碼
- CRM系統如何自動分配線索
- 智慧公安 多樣化情報研判分析檢索
- 行業分析| 影片監控——AI自動巡檢行業AI
- 基於ElasticSearch實現商品的全文檢索檢索Elasticsearch
- 【搜尋引擎】Solr Suggester 實現全文檢索功能-分詞和和自動提示Solr分詞
- 資料檢索
- Elasticsearch檢索文件。Elasticsearch
- Oracle全文檢索Oracle
- Markdown文字編輯器在資訊釋出及資訊互動功能上的使用(一)
- Spring原始碼分析(六)SpringAOP例項及標籤的解析Spring原始碼
- 藉助WebGL三維視覺化技術檢索3D動態影像Web視覺化3D
- 中望3D 2021 自動標註3D
- Python3自動生成MySQL資料字典的markdown文字PythonMySql
- ORACLE中帶引數、REF遊標及動態SQL例項OracleSQL