java文字相似度計算(Levenshtein Distance演算法(中文翻譯：編輯距離演算法))----程式碼和詳解

張小凡vip發表於2014-07-04

原文網址 : https://blog.csdn.net/q383965374/article/details/36901149

演算法程式碼實現:

package com.util;

public class SimFeatureUtil {

	private static int min(int one, int two, int three) {
		int min = one;
		if (two < min) {
			min = two;
		}
		if (three < min) {
			min = three;
		}
		return min;
	}

	public static int ld(String str1, String str2) {
		int d[][]; // 矩陣
		int n = str1.length();
		int m = str2.length();
		int i; // 遍歷str1的
		int j; // 遍歷str2的
		char ch1; // str1的
		char ch2; // str2的
		int temp; // 記錄相同字元,在某個矩陣位置值的增量,不是0就是1
		if (n == 0) {
			return m;
		}
		if (m == 0) {
			return n;
		}
		d = new int[n + 1][m + 1];
		for (i = 0; i <= n; i++) { // 初始化第一列
			d[i][0] = i;
		}
		for (j = 0; j <= m; j++) { // 初始化第一行
			d[0][j] = j;
		}
		for (i = 1; i <= n; i++) { // 遍歷str1
			ch1 = str1.charAt(i - 1);
			// 去匹配str2
			for (j = 1; j <= m; j++) {
				ch2 = str2.charAt(j - 1);
				if (ch1 == ch2) {
					temp = 0;
				} else {
					temp = 1;
				}
				// 左邊+1,上邊+1, 左上角+temp取最小
				d[i][j] = min(d[i - 1][j] + 1, d[i][j - 1] + 1, d[i - 1][j - 1]+ temp);
			}
		}
		return d[n][m];
	}
	public static double sim(String str1, String str2) {
		try {
			double ld = (double)ld(str1, str2);
			return (1-ld/(double)Math.max(str1.length(), str2.length()));
		} catch (Exception e) {
			return 0.1;
		}
	}

	public static void main(String[] args) {
		String str1 = "測試12";
		String str2 = "測試123";
		System.out.println("ld=" + ld(str1, str2));
		System.out.println("sim=" + sim(str1, str2));
	}
}

演算法介紹:

編輯距離（Edit Distance），又稱Levenshtein距離，是指兩個字串之間，由一個轉成另一個所需的最少編輯操作次數。許可的編輯操作包括將一個字元替換成另一個字元，插入一個字元，刪除一個字元。

演算法原理:

設我們可以使用d[ i , j ]個步驟（可以使用一個二維陣列儲存這個值），表示將串s[ 1…i ] 轉換為串t [ 1…j ]所需要的最少步驟個數，那麼，在最基本的情況下，即在i等於0時，也就是說串s為空，那麼對應的d[0,j] 就是增加j個字元，使得s轉化為t，在j等於0時，也就是說串t為空，那麼對應的d[i,0] 就是減少 i個字元，使得s轉化為t。

然後我們考慮一般情況，加一點動態規劃的想法，我們要想得到將s[1..i]經過最少次數的增加，刪除，或者替換操作就轉變為t[1..j]，那麼我們就必須在之前可以以最少次數的增加，刪除，或者替換操作，使得現在串s和串t只需要再做一次操作或者不做就可以完成s[1..i]到t[1..j]的轉換。所謂的“之前”分為下面三種情況：

1）我們可以在k個操作內將 s[1…i] 轉換為 t[1…j-1]

2）我們可以在k個操作裡面將s[1..i-1]轉換為t[1..j]

3）我們可以在k個步驟裡面將 s[1…i-1] 轉換為 t [1…j-1]

針對第1種情況，我們只需要在最後將 t[j] 加上s[1..i]就完成了匹配，這樣總共就需要k+1個操作。

針對第2種情況，我們只需要在最後將s[i]移除，然後再做這k個操作，所以總共需要k+1個操作。

針對第3種情況，我們只需要在最後將s[i]替換為 t[j]，使得滿足s[1..i] == t[1..j]，這樣總共也需要k+1個操作。而如果在第3種情況下，s[i]剛好等於t[j]，那我們就可以僅僅使用k個操作就完成這個過程。

最後，為了保證得到的操作次數總是最少的，我們可以從上面三種情況中選擇消耗最少的一種最為將s[1..i]轉換為t[1..j]所需要的最小操作次數。

演算法實現步驟:

步驟	說明
1	設定n為字串s的長度。("GUMBO") 設定m為字串t的長度。("GAMBOL") 如果n等於0，返回m並退出。如果m等於0，返回n並退出。構造兩個向量v0[m+1] 和v1[m+1]，串聯0..m之間所有的元素。
2	初始化 v0 to 0..m。
3	檢查 s (i from 1 to n) 中的每個字元。
4	檢查 t (j from 1 to m) 中的每個字元
5	如果 s[i] 等於 t[j]，則編輯代價cost為 0；如果 s[i] 不等於 t[j]，則編輯代價cost為1。
6	設定單元v1[j]為下面的最小值之一： a、緊鄰該單元上方+1：v1[j-1] + 1 b、緊鄰該單元左側+1：v0[j] + 1 c、該單元對角線上方和左側+cost：v0[j-1] + cost
7	在完成迭代 (3, 4, 5, 6) 之後，v1[m]便是編輯距離的值。

演算法步驟詳解:

本小節將演示如何計算"GUMBO"和"GAMBOL"兩個字串的Levenshtein距離。

步驟1、2

	v0	v1
		G	U	M	B	O
	0	1	2	3	4	5
G	1
A	2
M	3
B	4
O	5
L	6

初始化完了之後重點是理解步驟6.

步驟3-6，當 i = 1

	v0	v1
		G	U	M	B	O
	0	1	2	3	4	5
G	1	0
A	2	1
M	3	2
B	4	3
O	5	4
L	6	5

我們算V1中的值：以紅色的0所在的格子為例

根據步驟5：

如果 s[i] 等於 t[j]，則編輯代價cost為 0；
如果 s[i] 不等於 t[j]，則編輯代價cost為1。

和

步驟6:

設定單元v1[j]為下面的最小值之一：
a、緊鄰該單元上方+1：v1[j-1] + 1
b、緊鄰該單元左側+1：v0[j] + 1
c、該單元對角線上方和左側+cost：v0[j-1] + cost

得到:

a：該格子所在上方為 1加上1為2

b：該格子左邊為1加上1為2

c：該格子對角線上方和左側(也就是左斜對角)為0+ cost（cost是通過步驟5得到的編輯花費，這裡G等於G所以編輯花費為0，cost為0）為0

三個值中最小的為0，則該格子的值為0

其他格子以此類推。

步驟3-6，當 i = 2

		v0	v1
		G	U	M	B	O
	0	1	2	3	4	5
G	1	0	1
A	2	1	1
M	3	2	2
B	4	3	3
O	5	4	4
L	6	5	5

步驟3-6，當 i = 3

			v0	v1
		G	U	M	B	O
	0	1	2	3	4	5
G	1	0	1	2
A	2	1	1	2
M	3	2	2	1
B	4	3	3	2
O	5	4	4	3
L	6	5	5	4

步驟3-6，當 i = 4

				v0	v1
		G	U	M	B	O
	0	1	2	3	4	5
G	1	0	1	2	3
A	2	1	1	2	3
M	3	2	2	1	2
B	4	3	3	2	1
O	5	4	4	3	2
L	6	5	5	4	3

步驟3-6，當 i = 5

					v0	v1
		G	U	M	B	O
	0	1	2	3	4	5
G	1	0	1	2	3	4
A	2	1	1	2	3	4
M	3	2	2	1	2	3
B	4	3	3	2	1	2
O	5	4	4	3	2	1
L	6	5	5	4	3	2

步驟7

編輯距離就是矩陣右下角的值，v1[m] == 2。由"GUMBO"變換為"GAMBOL"的過程對於我來說是很只管的，即通過將"A"替換為"U"，並在末尾追加"L"這樣子（實際上替換的過程是由移除和插入兩個操作組合而成的）。

我們得到最小編輯距離為2

那麼它們的相似度為 (1-ld/(double)Math.max(str1.length(), str2.length()));

1 - 2/6=0.6666666666666667

參考連結:

http://www.cnblogs.com/ymind/archive/2012/03/27/fast-memory-efficient-Levenshtein-algorithm.html

http://teiraisan.blog.163.com/blog/static/12278141420098685835372/

其他語言的程式碼實現:

c++

In C++, the size of an array must be a constant, and this code fragment causes an error at compile time:   
  
int sz = 5;   
int arr[sz];   
  
This limitation makes the following C++ code slightly more complicated than it would be if the matrix could simply be declared as a two-dimensional array, with a size determined at run-time.   
  
In C++ it's more idiomatic to use the System Template Library's vector class, as Anders Sewerin Johansen has done in an alternative C++ implementation.   
  
Here is the definition of the class (distance.h):   
  
class Distance   
{   
  public:   
    int LD (char const *s, char const *t);   
  private:   
    int Minimum (int a, int b, int c);   
    int *GetCellPointer (int *pOrigin, int col, int row, int nCols);   
    int GetAt (int *pOrigin, int col, int row, int nCols);   
    void PutAt (int *pOrigin, int col, int row, int nCols, int x);   
};    
  
Here is the implementation of the class (distance.cpp):   
  
#include "distance.h"  
#include <string.h>   
#include <malloc.h>   
  
//****************************   
// Get minimum of three values   
//****************************   
  
int Distance::Minimum (int a, int b, int c)   
{   
int mi;   
  
  mi = a;   
  if (b < mi) {   
    mi = b;   
  }   
  if (c < mi) {   
    mi = c;   
  }   
  return mi;   
  
}   
  
//**************************************************   
// Get a pointer to the specified cell of the matrix   
//**************************************************    
  
int *Distance::GetCellPointer (int *pOrigin, int col, int row, int nCols)   
{   
  return pOrigin + col + (row * (nCols + 1));   
}   
  
//*****************************************************   
// Get the contents of the specified cell in the matrix    
//*****************************************************   
  
int Distance::GetAt (int *pOrigin, int col, int row, int nCols)   
{   
int *pCell;   
  
  pCell = GetCellPointer (pOrigin, col, row, nCols);   
  return *pCell;   
  
}   
  
//*******************************************************   
// Fill the specified cell in the matrix with the value x   
//*******************************************************   
  
void Distance::PutAt (int *pOrigin, int col, int row, int nCols, int x)   
{   
int *pCell;   
  
  pCell = GetCellPointer (pOrigin, col, row, nCols);   
  *pCell = x;   
  
}   
  
//*****************************   
// Compute Levenshtein distance   
//*****************************   
  
int Distance::LD (char const *s, char const *t)   
{   
int *d; // pointer to matrix   
int n; // length of s   
int m; // length of t   
int i; // iterates through s   
int j; // iterates through t   
char s_i; // ith character of s   
char t_j; // jth character of t   
int cost; // cost   
int result; // result   
int cell; // contents of target cell   
int above; // contents of cell immediately above   
int left; // contents of cell immediately to left   
int diag; // contents of cell immediately above and to left   
int sz; // number of cells in matrix   
  
  // Step 1    
  
  n = strlen (s);   
  m = strlen (t);   
  if (n == 0) {   
    return m;   
  }   
  if (m == 0) {   
    return n;   
  }   
  sz = (n+1) * (m+1) * sizeof (int);   
  d = (int *) malloc (sz);   
  
  // Step 2   
  
  for (i = 0; i <= n; i++) {   
    PutAt (d, i, 0, n, i);   
  }   
  
  for (j = 0; j <= m; j++) {   
    PutAt (d, 0, j, n, j);   
  }   
  
  // Step 3   
  
  for (i = 1; i <= n; i++) {   
  
    s_i = s[i-1];   
  
    // Step 4   
  
    for (j = 1; j <= m; j++) {   
  
      t_j = t[j-1];   
  
      // Step 5   
  
      if (s_i == t_j) {   
        cost = 0;   
      }   
      else {   
        cost = 1;   
      }   
  
      // Step 6    
  
      above = GetAt (d,i-1,j, n);   
      left = GetAt (d,i, j-1, n);   
      diag = GetAt (d, i-1,j-1, n);   
      cell = Minimum (above + 1, left + 1, diag + cost);   
      PutAt (d, i, j, n, cell);   
    }   
  }   
  
  // Step 7   
  
  result = GetAt (d, n, m, n);   
  free (d);   
  return result;   
       
}

Visual Basic

'*******************************   
'*** Get minimum of three values   
'*******************************   
  
Private Function Minimum(ByVal a As Integer, _   
                         ByVal b As Integer, _   
                         ByVal c As Integer) As Integer   
Dim mi As Integer   
                             
  mi = a   
  If b < mi Then   
    mi = b   
  End If   
  If c < mi Then   
    mi = c   
  End If   
     
  Minimum = mi   
                             
End Function   
  
'********************************   
'*** Compute Levenshtein Distance   
'********************************   
  
Public Function LD(ByVal s As String, ByVal t As String) As Integer   
Dim d() As Integer ' matrix   
Dim m As Integer ' length of t   
Dim n As Integer ' length of s   
Dim i As Integer ' iterates through s   
Dim j As Integer ' iterates through t   
Dim s_i As String ' ith character of s   
Dim t_j As String ' jth character of t   
Dim cost As Integer ' cost   
     
  ' Step 1  
     
  n = Len(s)   
  m = Len(t)   
  If n = 0 Then   
    LD = m   
    Exit Function   
  End If    
  If m = 0 Then   
    LD = n   
    Exit Function   
  End If    
  ReDim d(0 To n, 0 To m) As Integer   
     
  ' Step 2  
     
  For i = 0 To n   
    d(i, 0) = i   
  Next i   
     
  For j = 0 To m   
    d(0, j) = j   
  Next j   
  
  ' Step 3  
  
  For i = 1 To n   
       
    s_i = Mid$(s, i, 1)   
       
    ' Step 4  
       
    For j = 1 To m   
         
      t_j = Mid$(t, j, 1)   
         
      ' Step 5  
         
      If s_i = t_j Then   
        cost = 0  
      Else   
        cost = 1  
      End If   
         
      ' Step 6  
         
      d(i, j) = Minimum(d(i - 1, j) + 1, d(i, j - 1) + 1, d(i - 1, j - 1) + cost)   
       
    Next j   
       
  Next i   
     
  ' Step 7  
     
  LD = d(n, m)   
  Erase d   
  
End Function

Python程式碼

#!/user/bin/env python   
# -*- coding: utf-8 -*-   
  
class arithmetic():   
       
    def __init__(self):   
        pass  
    ''''' 【編輯距離演算法】 【levenshtein distance】 【字串相似度演算法】 '''  
    def levenshtein(self,first,second):   
        if len(first) > len(second):   
            first,second = second,first   
        if len(first) == 0:   
            return len(second)   
        if len(second) == 0:   
            return len(first)   
        first_length = len(first) + 1  
        second_length = len(second) + 1  
        distance_matrix = [range(second_length) for x in range(first_length)]    
        #print distance_matrix   
        for i in range(1,first_length):   
            for j in range(1,second_length):   
                deletion = distance_matrix[i-1][j] + 1  
                insertion = distance_matrix[i][j-1] + 1  
                substitution = distance_matrix[i-1][j-1]   
                if first[i-1] != second[j-1]:   
                    substitution += 1  
                distance_matrix[i][j] = min(insertion,deletion,substitution)   
        print distance_matrix   
        return distance_matrix[first_length-1][second_length-1]   
       
if __name__ == "__main__":   
    arith = arithmetic()   
    print arith.levenshtein('GUMBOsdafsadfdsafsafsadfasfadsfasdfasdfs','GAMBOL00000000000dfasfasfdafsafasfasdfdsa'

資料對齊-編輯距離演算法詳解（Levenshtein distance）
2018-12-14
演算法
Levenshtein：計算字串的編輯距離
2024-05-07
字串
【leetcode】72. Edit Distance 編輯距離計算
2019-05-10
LeetCode
編輯距離及編輯距離演算法
2019-02-16
演算法
編輯距離演算法
2024-03-29
演算法
中文文字相似度計算工具集
2018-04-19
02聚類演算法-相似度距離公式、維度災難
2018-12-08
聚類演算法公式
文字相似性計算--MinHash和LSH演算法
2021-03-17
演算法
演算法：編輯距離問題（動態規劃，詳細解答）
2020-11-15
演算法動態規劃
Java LeetCode 72. 編輯距離
2020-12-20
JavaLeetCode
怎樣衡量兩個字串的相似度（編輯距離動態規劃求解）
2018-06-12
字串動態規劃
面試程式碼題（華為）編輯距離
2020-03-12
面試
大規模文字相似度計算
2018-07-09
JAVA計算兩經緯度間的距離
2024-04-06
Java
【DP】編輯距離
2024-04-30
文字相似度計算之餘弦定理
2019-05-13
漢明距離（Hamming distance）
2024-08-19
Spark/Scala實現推薦系統中的相似度演算法（歐幾里得距離、皮爾遜相關係數、餘弦相似度：附實現程式碼）
2020-12-11
Spark演算法
行人重識別(17)——程式碼實踐之區域性對齊最小距離演算法（local_distance.py）
2021-01-03
演算法
NLP segment-05-文字相似度計算 similarity java 開源實現
2024-11-02
MILAJava
LeetCode72編輯距離
2020-10-19
LeetCode
線性dp：編輯距離
2024-08-23
902.最短編輯距離
2024-08-03
經緯度距離換算
2019-05-07
動態規劃-編輯距離
2018-06-26
動態規劃
字串相似度演算法完全指南：編輯、令牌與序列三類演算法的全面解析與深入分析
2024-07-04
字串演算法
NLP 中文形近字相似度演算法開源實現
2021-11-23
演算法
Php兩點地理座標距離的計算方法和具體程式碼
2019-01-15
PHP
根據兩點經緯度計算距離和角度——java實現
2018-04-24
Java
Leetcode 編輯距離（動態規劃）
2018-09-12
LeetCode動態規劃
機器學習基礎：相似度和距離度量究竟是什麼
2019-10-10
機器學習
elasticsearch演算法之詞項相似度演算法(二)
2022-01-24
Elasticsearch演算法
elasticsearch演算法之詞項相似度演算法(一)
2022-01-20
Elasticsearch演算法
java 經緯度處理、計算兩地的距離、獲取當前一定距離以內的經緯度值
2022-02-14
Java
影象相似度中的Hash演算法
2019-07-28
演算法
[Java 8 Tutorial翻譯系列]Java forEach詳解
2019-05-14
Java
透過經緯度計算距離獲取附近商家
2024-02-20
程式碼線上編譯器（上）- 編輯及編譯
2018-10-30
編譯
PHP實現透過經緯度計算距離和查附近店門
2022-11-17
PHP