資訊檢索（六）-- 文字分析及自動標引(Part 3)

Thesaurus及term自動關聯

上文講到的wordnet和hownet都是利用了專家知識和大量的人工整理出來的，那麼，可不可以自動生成相似的詞語呢？
Definition: Two words are similar if they co-occur with similar word（類似word2vec的思想）
聚類(在資料探勘中講過，在這不多涉及)

Partitional clustering
最典型的是k-means：

這是典型的E-M演算法思想。E階段固定引數 $\theta$ (類中心點位置)，M階段最大化隱型變數的概率（每個點屬於哪類）

K-means的停止條件：
- A fixed number of iterations.
- Term partition unchanged.
- Centroid positions don’t change.
K-means的優點和缺點：
Hierarchical clustering
- Bottom-Up (agglomerative):Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together.
- Top-Down (divisive):Starting with all the data in a single cluster, consider every possible way to divide the cluster into two. Choose the best division and recursively operate on both side.
  
  在hierarchical clustering中衡量類間相似度的方法：
- Single linkage (nearest neighbor):the distance between two clusters is determined by the distance of the two closest objects in the different clusters.
- Complete linkage (furthest neighbor):the distances between clusters are determined by the greatest distance between any two objects in the different clusters (i.e., by the “furthest neighbors”).
- Group average linkage:the distance between two clusters is calculated as the average distance between all pairs of objects in the two different clusters.
  
  hierarchical clustering 的優缺點：
  - No need to specify the number of clusters in advance.
  - Hierarchal nature maps nicely onto human intuition for some domains
  - They do not scale well: time complexity of at least O(sqr(n)), where n is the number of total objects.
  - Like any heuristic search algorithms, local optima are a problem.
  - Interpretation of results is (very) subjective

詞語的聚類就是按下面的詞-詞相似度來聚類，以達到高的類內相似度和低的類間相似度
在這裡插入圖片描述

Construction of term phrases

在這裡插入圖片描述
"重症流感藥物"在這裡看成了一個短語。
用統計方法確定短語：
$MI_{kh} = \frac{Pair_{kh}}{TF_kTF_h}$
其中k、h是詞語，Pair_kh表示kh相鄰的次數，TF代表總詞頻。

總結

Automatic indexing process capable of producing high-performance retrieval results:

(1) Terms in the medium-frequency ranges with positive discrimination values are used as index terms directly without further transformation.
(2) The broad high-frequency terms with negative discrimination values are either discarded or incorporated into phrases with low-frequency terms.
(3) The narrow low-frequency terms with discrimination values close to zero are broadened by inclusion into thesaurus categories.

Expand query：
在這裡插入圖片描述
Term ambiguity may introduce irrelevant statistically correlated terms.
e.g. “Apple computer” ->“Apple red fruit computer”
So, only expand query with terms that are similar to all terms in the query.
e.g. “fruit” not added to “Apple computer” since it is far from “computer.” “fruit” added to “apple pie” since “fruit” close to both “apple” and “pie.

資訊檢索（六）-- 文字分析及自動標引(Part 3)

Thesaurus及term自動關聯

Construction of term phrases

總結

相關文章