【學習筆記】字串匹配

暴力匹配（BF）演算法

暴力匹配（BF）演算法是普通的模式匹配演算法。模式匹配是模式串 $P$ 在主串 $T$ 中的定位運算。
BF演算法的思想就是將模式串 $P$ 的第一個字元與主串 $T$ 的第一個字元進行匹配，若相等，則繼續比較 $P$ 的第二個字元和 $T$ 的第二個字元；若不相等，則比較 $P$ 的第二個字元和 $T$ 的第一個字元，依次比較，直到得出最後的匹配結果。
在這裡插入圖片描述

Rabin-Karp（RK）演算法

RK演算法引入了雜湊值計算。如果兩個字串的雜湊值不相同，則它們肯定不相同；如果它們雜湊值相同，它們不一定相同。
RK演算法的思想就是將模式串 $P$ （長度為 $k$ ）的雜湊值與主串 $T$ 中每一個長度為 $k$ 的子串的雜湊值相比較，只保留雜湊值相同的子串進行匹配。
在這裡插入圖片描述

Knuth-Morria-Pratt（KMP）演算法

KMP演算法與BF演算法類似，但是當某個字元失配時，並不是跳回模式串 $P$ 的開頭，主串 $T$ 也不需要回溯，而是根據next陣列儲存的數值，主串 $T$ 保持不動，模式串 $P$ 跳到 $n e x t [j] = n$ 處，這樣就可以跳過模式串 $P$ 的前 $n$ 個字元。
現有模式串“ABCDABD”，

“A”的字首和字尾為空集，共用長度為0；
“AB”的字首為[A]，字尾為[B]，共用長度為0；
“ABC”的字首為[A, AB]，字尾為[BC, C]，共用長度為0；
“ABCD”的字首為[A, AB, ABC]，字尾為[BCD, CD, D]，共用長度為0；
“ABCDA”的字首為[A, AB, ABC, ABCD]，字尾為[BCDA, CDA, DA, A]，共用長度為1；
“ABCDAB”的字首為[A, AB, ABC, ABCD, ABCDA]，字尾為[BCDAB, CDAB, DAB, AB, B]，共用長度為2；
“ABCDABD”的字首為[A, AB, ABC, ABCD, ABCDA, ABCDAB]，字尾為[BCDABD, CDABD, DABD, ABD, BD, D]，共用長度為0；

因此可得部分匹配值為：
在這裡插入圖片描述
當D與空格不匹配時，前面的“ABCDAB”是匹配的，查表可知，最後一個匹配字元B對應的部分匹配值為2，因此移動位數可由下式計算：
$\qquad\qquad$ 移動位數 = 已匹配字元數 - 對應的部分匹配值

Boyer-Moore（BM）演算法

BM演算法從模式串 $P$ 的尾部開始匹配，該演算法定義了兩個規則：

壞字元規則：當主串 $T$ 中的某個字元跟模式串 $P$ 的某個字元不匹配時，我們稱文字串中的這個失配字元為壞字元，此時模式串 $P$ 需要向右移動，移動的位數 = 壞字元在模式串中的位置 - 壞字元在模式串中最右出現的位置。此外，如果"壞字元"不包含在模式串之中，則最右出現位置為-1。
好字尾規則：當字元失配時，後移位數 = 好字尾在模式串中的位置 - 好字尾在模式串上一次出現的位置，且如果好字尾在模式串中沒有再次出現，則為-1。

每次後移這兩個規則之中的較大值。這兩個規則的移動位數，只與模式串有關，與主串無關。

此時無好字尾，按壞字元規則，當前與壞字元“P”進行匹配的模式串字元“E”位於模式串第6位，壞字元“P”在模式串中最右出現的位置是第4位，因此向右移動位數 = 6 - 4 = 2，即令主串中的當前壞字元與模式串中最右出現的壞字元對齊。
當“I”與“A”不匹配，此時好字尾有[“MPLE”、“PLE”、“LE”、“E”]，根據好字尾規則，所有的好字尾中，只有位於第6位的好字尾“E” 在模式串的第0位又一次出現，因此後移位數 = 6 - 0 = 6。而根據壞字元規則，後移位數 = 2 - (-1) = 3。因此使用較大的後移位數6。

程式碼實現

import java.util.Arrays;

public class StringMatch {
    public static int bruteForce(String target, String pattern) {
        if (target == null || pattern == null)
            return -1;
        char[] st = target.toCharArray();
        char[] sp = pattern.toCharArray();
        int tLen = st.length, pLen = sp.length;
        if (tLen == 0 || pLen == 0 || tLen < pLen)
            return -1;

        int i = 0, j = 0;
        while (i < tLen && j < pLen) {
            if (st[i] == sp[j]) {
                i++;
                j++;
            } else {
                i -= j - 1;
                j = 0;
            }
        }
        if (j == pLen)
            return i - j;
        return -1;
    }

    private static int hash(String str) {
        int prime = 11113; // 一般取質數
        int hashCode = 0;
        for (int i = 0; i < str.length(); i++)
            hashCode = ((hashCode << 5) + str.charAt(i)) % prime;
        return hashCode;
    }

    public static int RK(String target, String pattern) {
        if (target == null || pattern == null)
            return -1;
        int tLen = target.length(), pLen = pattern.length();
        if (tLen == 0 || pLen == 0 || tLen < pLen)
            return -1;

        int hashCode = hash(pattern);
        String subStr;
        for (int i = 0; i <= tLen - pLen; i++) {
            subStr = target.substring(i, i + pLen);
            if (hashCode == hash(subStr) && bruteForce(subStr, pattern) == 0)
                return i;
        }
        return -1;
    }

    private static int[] badTable(char[] sp, int len) {
        int[] bad_table = new int[256]; // ASCII表中的256個字元對應的移動距離
        Arrays.fill(bad_table, -1);
        for (int i = 0; i < len - 1; i++)
            bad_table[sp[i]] = len - 1 - i;
        return bad_table;
    }

    private static int[] goodTable(char[] sp, int len) {
        int[] suffix = new int[len];
        suffix[len-1] = len;
        for (int i = len - 2, j = len - 2; i >= 0; i--) {
            j = i;
            while (j >= 0 && sp[j] == sp[len - 1 - i + j])
                j--;
            suffix[i] = i - j;
        }

        int[] good_table = new int[len];
        Arrays.fill(good_table, len);
        for (int i = len - 1, j = 0; i >= 0; i--) {
            if (suffix[i] == i + 1) {
                for (; j < len - 1 - i; j++)
                    if (good_table[j] == len)
                        good_table[j] = len - 1 - i;
            }
        }
        for (int i = 0; i <= len - 2; i++) {
            good_table[len - 1 - suffix[i]] = len - 1 - i;
        }
        return good_table;
    }

    public static int BM(String target, String pattern) {
        if (target == null || pattern == null)
            return -1;
        int tLen = target.length(), pLen = pattern.length();
        if (tLen == 0 || pLen == 0 || tLen < pLen)
            return -1;

        char[] st = target.toCharArray();
        char[] sp = pattern.toCharArray();
        int[] bad_table = badTable(sp, pLen);
        int[] good_table = goodTable(sp, pLen);

        int j, i = 0;
        while (i <= tLen - pLen) {
            j = pLen - 1;
            while (j >= 0 && st[i+j] == sp[j])
                j--;
            if (j < 0)
                return i;
            i += Math.max(good_table[j], bad_table[st[i+j]] - (pLen - 1 - i));
        }
        return -1;
    }

    private static int[] kmpNext(String pattern, int len) {
        int[] next = new int[len];
        next[0] = 0; // 已匹配字串長度為1，部分匹配值為0
        for (int i = 1, j = 0; i < len; i++) {
            while (j > 0 && pattern.charAt(i) != pattern.charAt(j))
                j = next[j-1];
            if (pattern.charAt(i) == pattern.charAt(j))
                j++;
            next[i] = j;
        }
        return next;
    }

    public static int KMP(String target, String pattern) {
        if (target == null || pattern == null)
            return -1;
        int tLen = target.length(), pLen = pattern.length();
        if (tLen == 0 || pLen == 0 || tLen < pLen)
            return -1;

        int[] next = kmpNext(pattern, pLen);
        for (int i = 0, j = 0; i < tLen; i++) {
            while (j > 0 && target.charAt(i) != pattern.charAt(j))
                j = next[j-1];
            if (target.charAt(i) == pattern.charAt(j))
                j++;
            if (j == pLen)
                return i - j + 1;
        }
        return -1;
    }
}