c#-SimHash匹配相似-演算法

曾祥展發表於2013-09-09

使用場景：Google 的 simhash 演算法

 //通過大量測試，simhash用於比較大文字，比如500字以上效果都還蠻好，距離小於3的基本都是相似，誤判率也比較低。

 //從我的經驗，如果我們假定N是每個塊的大小，M是重疊的字元的數目，N = 4和M = 3是最好的選擇

    public class SimHashAnalyser : IAnalyser
    {

        private const int HashSize = 32;

        public float GetLikenessValue(string needle, string haystack)
        {
            var needleSimHash = this.DoCalculateSimHash(needle);
            var hayStackSimHash = this.DoCalculateSimHash(haystack);
            return (HashSize - GetHammingDistance(needleSimHash, hayStackSimHash)) / (float)HashSize;
        }

        private static IEnumerable<int> DoHashTokens(IEnumerable<string> tokens)
        {
            var hashedTokens = new List<int>();
            foreach (string token in tokens)
            {
                hashedTokens.Add(token.GetHashCode());
            }
            return hashedTokens;
        }

        private static int GetHammingDistance(int firstValue, int secondValue)
        {
            var hammingBits = firstValue ^ secondValue;
            var hammingValue = 0;
            for (int i = 0; i < 32; i++)
            {
                if (IsBitSet(hammingBits, i))
                {
                    hammingValue += 1;
                }
            }
            return hammingValue;
        }

        private static bool IsBitSet(int b, int pos)
        {
            return (b & (1 << pos)) != 0;
        }

        private int DoCalculateSimHash(string input)
        {
            ITokeniser tokeniser = new OverlappingStringTokeniser(4, 3);
            var hashedtokens = DoHashTokens(tokeniser.Tokenise(input));
            var vector = new int[HashSize];
            for (var i = 0; i < HashSize; i++)
            {
                vector[i] = 0;
            }

            foreach (var value in hashedtokens)
            {
                for (var j = 0; j < HashSize; j++)
                {
                    if (IsBitSet(value, j))
                    {
                        vector[j] += 1;
                    }
                    else
                    {
                        vector[j] -= 1;
                    }
                }
            }

            var fingerprint = 0;
            for (var i = 0; i < HashSize; i++)
            {
                if (vector[i] > 0)
                {
                    fingerprint += 1 << i;
                }
            }
            return fingerprint;
        }


    }



    public interface IAnalyser
    {
        float GetLikenessValue(string needle, string haystack);
    }

    public interface ITokeniser
    {
        IEnumerable<string> Tokenise(string input);
    }

    public class FixedSizeStringTokeniser : ITokeniser
    {
        private readonly ushort tokensize = 5;
        public FixedSizeStringTokeniser(ushort tokenSize)
        {
            if (tokenSize < 2 || tokenSize > 127)
            {
                throw new ArgumentException("Token 不能超出範圍");
            }
            this.tokensize = tokenSize;
        }

        public IEnumerable<string> Tokenise(string input)
        {
            var chunks = new List<string>();
            int offset = 0;
            while (offset < input.Length)
            {
                chunks.Add(new string(input.Skip(offset).Take(this.tokensize).ToArray()));
                offset += this.tokensize;
            }
            return chunks;
        }

    }


    public class OverlappingStringTokeniser : ITokeniser
    {
              
        private readonly ushort chunkSize = 4;
        private readonly ushort overlapSize = 3;

        public OverlappingStringTokeniser(ushort chunkSize, ushort overlapSize)
        {
            if (chunkSize <= overlapSize)
            {
                throw new ArgumentException("Chunck 必須大於 overlap");
            }
            this.overlapSize = overlapSize;
            this.chunkSize = chunkSize;
        }

        public IEnumerable<string> Tokenise(string input)
        {
            var result = new List<string>();
            int position = 0;
            while (position < input.Length - this.chunkSize)
            {
                result.Add(input.Substring(position, this.chunkSize));
                position += this.chunkSize - this.overlapSize;
            }
            return result;
        }


    }

使用：

    const string HayStack = "中國香港………………";
    const string Needle = "中國香港 2013………………";

    IAnalyser analyser = new SimHashAnalyser();
    var likeness = analyser.GetLikenessValue(Needle, HayStack);

    Console.Clear();
    Console.WriteLine("Likeness: {0}%", likeness * 100);
    Console.ReadKey();

SimHash for c#

python 相似語句匹配（非機器學習）
2018-11-20
Python機器學習
python sift 特徵匹配圖片相似度
2020-12-30
Python特徵
相似人群畫像演算法
2018-07-24
演算法
中文自然語言處理工具集：分詞，相似度匹配
2018-04-03
自然語言處理分詞
影象相似度中的Hash演算法
2019-07-28
演算法
SSIM結構相似性演算法
2020-12-18
演算法
演算法計算字串的相似度
2014-04-19
演算法字串
elasticsearch演算法之詞項相似度演算法(二)
2022-01-24
Elasticsearch演算法
elasticsearch演算法之詞項相似度演算法(一)
2022-01-20
Elasticsearch演算法
模式匹配-KMP演算法
2016-02-22
模式KMP演算法
SQL Server相似比較演算法實現
2014-09-12
SQLServer演算法
字串匹配演算法(一)
2021-07-30
字串匹配演算法
KMP模式匹配演算法
2018-10-05
KMP模式演算法
字串匹配演算法：KMP
2023-11-04
字串匹配演算法KMP
KMP字串匹配演算法
2023-10-01
KMP字串匹配演算法
字串匹配KMP演算法初探
2012-09-05
字串匹配KMP演算法
【工程應用十】基於十六角度量化的夾角餘弦相似度模版匹配演算法原理解析。
2024-07-16
演算法
移動端影像相似度演算法選型
2018-10-24
演算法
字串匹配-BF演算法和KMP演算法
2021-03-13
字串匹配演算法KMP
字串匹配演算法(三)-KMP演算法
2021-08-02
字串匹配演算法KMP
二分圖的最大匹配、完美匹配和匈牙利演算法
2016-10-02
演算法
elasticsearch演算法之推薦系統的相似度演算法(一)
2022-01-27
Elasticsearch演算法
模式匹配kmp演算法（c++）
2020-11-14
模式KMP演算法C++
字串匹配之KMP《演算法很美》
2021-01-04
字串匹配KMP演算法
字串匹配問題——KMP演算法
2018-03-23
字串匹配KMP演算法
字串匹配之Sunday演算法
2017-05-29
字串匹配演算法
串(2)--模式匹配演算法
2014-02-14
模式演算法
文字相似性計算--MinHash和LSH演算法
2021-03-17
演算法
【轉】感知雜湊演算法——找出相似的圖片
2015-07-16
演算法
字串相似度演算法（編輯距離演算法 Levenshtein Distance）（轉）
2015-05-18
字串演算法
字串匹配演算法(二)-BM演算法詳解
2021-07-31
字串匹配演算法
字串匹配演算法【未完待續】
2020-01-21
字串匹配演算法
kmp字串匹配，A星尋路演算法
2018-09-21
KMP字串匹配演算法
字串匹配基礎下——KMP 演算法
2018-12-11
字串匹配KMP演算法
點雲匹配和ICP演算法概述
2016-08-31
演算法
樸素的模式匹配演算法
2015-05-15
模式演算法
地圖匹配演算法實踐
2015-08-06
地圖演算法
字串匹配的Boyer-Moore演算法
2013-05-03
字串匹配演算法

c#-SimHash匹配相似-演算法

相關文章