elasticsearch演算法之詞項相似度演算法(一)

無風聽海發表於2022-01-20

一、詞項相似度

elasticsearch支援拼寫糾錯,其建議詞的獲取就需要進行詞項相似度的計算;今天我們來通過不同的距離演算法來學習一下詞項相似度演算法;

二、資料準備

計算詞項相似度,就需要首先將詞項向量化;我們可以使用以下兩種方法

字元向量化,其將每個字元對映為一個唯一的數字,我們可以直接使用字元編碼即可;

import numpy as np

def vectorize_words(words):
    lower_words = [word.lower() for word in words]
    words = [np.array([ord(c) for c in word])  for word in lower_words]
    return words

vlan = 'valn'
vlna = 'vlna'
vlan233 ='vlan233'
http='http'

vlan_vector, vlna_vector, vlan233_vector, http_vector = vectorize_words([vlan, vlna, vlan233, http])
print(f'vlan_vector: {vlan_vector}')
print(f'vlna_vector: {vlna_vector}')
print(f'vlan233_vector: {vlan233_vector}')
print(f'http_vector: {http_vector}')

vlan_vector: [118  97 108 110]
vlna_vector: [118 108 110  97]
vlan233_vector: [118 108  97 110  50  51  51]
http_vector: [104 116 116 112]

三、漢明距離

漢明距離是非常流行的距離度量方法,在資訊和通訊領域中有廣泛的使用;其表示兩個長度相等的字串之間互異字元或符號的數量。考慮長度為n的兩個詞項u和v,漢明距離的數學表示式為:

\[hd(u,v)=\sum_{i}^{n}(u_{i}\ne v_{i} ) \]

同時也可以通過除以詞的總長度來計算歸一化的漢明距離

\[norm\_hd(u,v) = \frac {\sum_{i}^{n}(u_{i}\ne v_{i} )} {n} \]

我們使用以下的hamming_distance來計算漢明編輯距離,並通過引數norm來控制是否進行歸一化;

def hamming_distance(u, v, norm=True):
    if u.shape != v.shape:
        raise ValueError('the vector must have equal lenghts.')

    return (u!=v).mean() if norm else (u!=v).sum()

我們通過以下程式碼來計算valn跟其他三個詞的漢明距離;

通過輸出資訊我們可以看到最小的編輯距離是2,歸一化之後是0.5;

vlan = 'vlan'
vlna = 'vlna'
http='http'
words = [vlan, vlna, http]


input_word = 'valn'
input_vector = vectorize_words([input_word]).pop()

word_vectors = vectorize_words(words)
for word, vector in zip(words, word_vectors):
    print(f'{input_word} and {word} hamming distance is {hamming_distance(input_vector, vector, norm=False)}')

print()
for word, vector in zip(words, word_vectors):
    print(f'{input_word} and {word} hamming distance is {hamming_distance(input_vector, vector)}')
    
    
    
valn and vlan hamming distance is 2
valn and vlna hamming distance is 3
valn and http hamming distance is 4

valn and vlan hamming distance is 0.5
valn and vlna hamming distance is 0.75
valn and http hamming distance is 1.0

四、曼哈頓距離

曼哈頓距離主要計算兩個字串每個位置上的字元之間的差值之和;曼哈頓距離也稱為城市街區距離、L1範數、計程車度量;

同樣長度為n的兩個詞u、v,曼哈頓距離的數學公式為

\[md(u,v)=\|u - v\|_{1} = \sum_{i=1}^{n}|u_{i} - v_{i}| \]

同樣我們也可以除以詞的長度來計算曼哈頓規劃距離

\[norm\_md(u,v)=\frac {\|u - v\|_{1}} {n} = \frac {\sum_{i=1}^{n}|u_{i} - v_{i}|} {n} \]

我們可以使用以下方法來計算曼哈頓距離,同樣通過norm來控制歸一化;

def manhattan_distance(u, v, norm=True):
    if u.shape != v.shape:
        raise ValueError('the vector must have equal lenghts.')

    return abs(u-v).mean() if norm else abs(u-v).sum()

使用同樣的詞,使用以下程式碼計算曼哈頓距離;

同樣可以看到距離最小的valn和vlan的曼哈頓距離是22,歸一化之後是5.5;

vlan = 'vlan'
vlna = 'vlna'
http='http'
words = [vlan, vlna, http]


input_word = 'valn'
input_vector = vectorize_words([input_word]).pop()
word_vectors = vectorize_words(words)
for word, vector in zip(words, word_vectors):
    print(f'{input_word} and {word} manhattan distance is {manhattan_distance(input_vector, vector, norm=False)}')

print()
for word, vector in zip(words, word_vectors):
    print(f'{input_word} and {word} manhattan distance is {manhattan_distance(input_vector, vector)}')
    
    
valn and vlan manhattan distance is 22
valn and vlna manhattan distance is 26
valn and http manhattan distance is 43

valn and vlan manhattan distance is 5.5
valn and vlna manhattan distance is 6.5
valn and http manhattan distance is 10.75

五、歐幾里得距離

歐幾里得距離計算兩點之間最短的直線距離,也稱為歐幾里得範數、L2範數或L2距離;

同樣長度為n的兩個詞u、v,歐幾里得距離的數學公式為

\[ed(u,v)=\|u - v\|_{2} = \sqrt{\sum_{i=1}^{n}(u_{i} - v_{i})^2} \]

我們使用以下方法計算歐幾里得距離

def euclidean_distance(u, v):
    if u.shape != v.shape:
        raise ValueError('the vector must have equal lenghts.')

    return np.sqrt(np.sum(np.square(u - v)))

同樣的關鍵字,使用以下程式碼計算歐幾里得距離;

vlan = 'vlan'
vlna = 'vlna'
http='http'
words = [vlan, vlna, http]


input_word = 'valn'
input_vector = vectorize_words([input_word]).pop()
word_vectors = vectorize_words(words)
for word, vector in zip(words, word_vectors):
    print(f'{input_word} and {word} euclidean distance is {euclidean_distance(input_vector, vector)}')
    
valn and vlan euclidean distance is 15.556349186104045
valn and vlna euclidean distance is 17.146428199482248
valn and http euclidean distance is 25.0

相關文章