由簡入繁--Trie樹實戰

弒曉風發表於2019-03-22

原文網址 : https://juejin.im/post/5c93355651882556dc4be6c9

You only get one shot, do not miss your chance to blow.
你只有一發子彈，不要錯過引爆全場的機會。

日誌

2019年3月24日 trie實戰（一）統計字串中指定字元出現的次數

2019年3月25日 trie實戰（二）基於AC自動機的敏感詞過濾系統

引言

學習不能只侷限於實現，更重要的是學會自己思考，舉一反三。學的是思想，如何轉化成自己的東西。

trie樹又稱“字典樹”。關鍵詞提示功能在日常生活中非常常用，通常只需要輸出字首，它就會給出相應的提示。呢具體是怎麼實現的呢？本文主要分享了基於trie樹的一個簡易的搜尋提示以及trie樹常用的應用場景。所有原始碼均已上傳至github:連結

ps:Trie 樹的本質，就是利用字串之間的公共字首，將重複的字首合並在一起。

模擬搜尋關鍵詞提示功能

本次實現其實也可以改造一下，將使用者習慣(輸入內容)存成一顆trie樹

以how，hi，her，hello，so，see為例

宣告一個tire類

這裡偷了個小懶，整了一個內部類。

    public class TrieNode {
        /**
         * 字元
         */
        public char data;
        /**
         * 子節點
         */
        TrieNode[] children;
        /**
         * 標識
         */
        boolean isEndingChar;

        TrieNode(char data) {
            children = new TrieNode[26];
            isEndingChar = false;
            this.data = data;
        }
    }複製程式碼

初始化

通常根節點是不儲存任何資訊的，起一個佔位符的作用

 /**
     * 根節點
     */
    private TrieNode root;

    /**
     * 預製單詞數量
     */
    private int count;

    /**
     * 提示詞列表
     */
    private List<String> list;

    /**
     * 輸入值
     */
    private String pattern;

    /**
     * 儲存一個無意義的字元
     */
    private TrieTree() {
        root = new TrieNode('/');
        count = 0;
        list = new ArrayList<>();
    }複製程式碼

插入

這裡儲存的是ASCII碼，相對而言要省記憶體一些。

    private void insert(char[] txt) {
        TrieNode p = root;
        for (char c : txt) {
            //當前字元的ASCII碼 - 'a'的 ASCII碼
            int index = c - 'a';
            if (null == p.children[index]) {
                TrieNode node = new TrieNode(c);
                p.children[index] = node;
            }
            p = p.children[index];
        }
        ++count;
        p.isEndingChar = true;
    }複製程式碼

查詢

    private boolean contains(String pattern) {
        char[] patChars = pattern.toCharArray();
        TrieNode p = root;
        for (char patChar : patChars) {
            int index = patChar - 'a';
            if (null == p.children[index])
                return false;
            p = p.children[index];
        }
        return p.isEndingChar;
    }複製程式碼

模糊提示匹配

    private void match() {
        char[] patChars = pattern.toCharArray();
        TrieNode p = root;
        for (char patChar : patChars) {
            int index = patChar - 'a';
            if (null == p.children[index])
                return;
            p = p.children[index];
        }
        //開始遍歷 p,將所有匹配的字元加入strs
        traversal(p, "");
    }複製程式碼

遞迴遍歷節點

    private void traversal(TrieNode trieNode, String str) {
        if (null != trieNode) {
            str += trieNode.data;
            if (trieNode.isEndingChar) {
                String curStr = pattern.length() == 1 ? 
		str : pattern + str.substring(pattern.length() - 1);
                if (!list.contains(curStr))
                    list.add(curStr);
                return;
            }
            for (int i = 0; i < trieNode.children.length; i++) {
                traversal(trieNode.children[i], str);
            }
        }
    }複製程式碼

測試程式碼

人為構造一個tire樹

ps：這裡的儲存會導致樹很高，比如 l l o，其實可以合成llo，也就是縮點優化。這裡暫時不實現了。

    private void initTries() {
//        how，hi，her，hello，so，see
//                   /
//              h         s
//           e  i  o    o   e
//         l         w        e
//      l
//   o
        char[] how = "how".toCharArray();
        insert(how);
        char[] hi = "hi".toCharArray();
        insert(hi);
        char[] her = "her".toCharArray();
        insert(her);
        char[] hello = "hello".toCharArray();
        insert(hello);
        char[] so = "so".toCharArray();
        insert(so);
        char[] see = "see".toCharArray();
        insert(see);
    }複製程式碼

測試程式碼

    public static void main(String[] args) {
        TrieTree trieTree = new TrieTree();
        trieTree.initTries();
        String str = "hello";
        boolean res = trieTree.contains(str);
        System.out.println("trie樹是否包含" + str + "返回結果:" + res);

        trieTree.pattern = "h";
        trieTree.match();
        System.out.println("單字元模糊匹配 " + trieTree.pattern + ":");
        trieTree.printAll();

        trieTree.list.clear();
        trieTree.pattern = "he";
        trieTree.match();
        System.out.println("多字元模糊匹配 " + trieTree.pattern + ":");
        trieTree.printAll();
    }複製程式碼

測試結果

統計字串中指定字元出現的次數

還是以26個字母為大前提.字典樹正是因為它搜尋快捷的特性，才會深受搜尋引擎的喜愛。只要有空間（確實很耗記憶體），就能為所欲為（快）。

思考

這裡主要是分享這樣的一種思想，如何利用現有程式碼，根據需求，將其進行改造成滿足的需求的程式碼。有時候不需要重複造輪子，但是關鍵時刻需要會用輪子。

改造TireNode類

這裡加了一個frequency屬性，為了統計高頻詞彙。並且將children由陣列改成map，更便於儲存，相當而言，更節省空間。

    private class TrieNode {
        /**
         * 字元
         */
        public char data;
        /**
         * 出現頻率
         */
        int frequency;

        boolean isEndingChar;
        /**
         * 子節點
         */
        Map<Character, TrieNode> children;

        TrieNode(char data) {
            this.data = data;
            children = new HashMap<>();
            isEndingChar = false;
        }
    }複製程式碼

初始化

    /**
     * 根節點
     */
    private TrieNode root;
    /**
     * 計數
     */
    private int count;

    /**
     * 無參構造方法
     */
    private TrieTreeAlgo() {
        root = new TrieNode('/');
        count = 0;
    }複製程式碼

改造插入方法

frequency用來計數，計算該字元的頻率
isEndingChar和之前一樣，用來判斷是否是該單詞的結尾

    private void insert(String txt) {
        TrieNode p = root;
        char[] txtChar = txt.toCharArray();
        for (Character c : txtChar) {
            if (!p.children.containsKey(c)) {
                TrieNode trieNode = new TrieNode(c);
                p.children.put(c, trieNode);
            }
            p = p.children.get(c);
            ++p.frequency;
        }
        ++count;
        p.isEndingChar = true;
    }複製程式碼

統計方法

增加一個統計方法，計算某一單詞的出現頻率，當isEndingChar==true，說明已經匹配到該單詞了，並且到末尾，然後該字元頻率數量減去子節點的個數即可

    private int frequency(String pattern) {
        char[] patChars = pattern.toCharArray();
        TrieNode p = root;
        for (char patChar : patChars) {
            if (p.children.containsKey(patChar)) {
                p = p.children.get(patChar);
            }
        }
        if (p.isEndingChar) return p.frequency - p.children.size();
        return -1;
    }複製程式碼

測試程式碼

初始化要插入字典樹的單詞（這裡其實可以擴充套件一下下，插入一篇文章，插入使用者常輸入詞彙等等。）

    private void initTries() {
        String txt = "he her hello home so see say just so so hello world";
        String[] strs = txt.split(" ");
        for (String str : strs) {
            insert(str);
        }
    }複製程式碼

測試程式碼

so 一個高頻詞彙
he 一個普通單詞，並且裡面的單詞還有含有它的，比如her，hello
hel一個不存在的單詞

    public static void main(String[] args) {
        TrieTreeAlgo trieTreeAlgo = new TrieTreeAlgo();
        trieTreeAlgo.initTries();
        System.out.println("共計" + trieTreeAlgo.count + "個單詞。");
        String so = "so";
        int soCount = trieTreeAlgo.frequency(so);
        System.out.println(so + "出現的次數為：" + (soCount > 0 ? soCount : 0));
        String he = "he";
        int heCount = trieTreeAlgo.frequency(he);
        System.out.println(he + "出現的次數為：" + (heCount > 0 ? heCount : 0));
        String hel = "hel";
        int helCount = trieTreeAlgo.frequency(hel);
        System.out.println(hel + "出現的次數為：" + (helCount > 0 ? helCount : 0));
    }複製程式碼

測試結果

基於AC自動機的敏感詞過濾系統

既然有了關鍵詞匹配提示，那麼相對應的，自然也應該有敏感詞過濾，隨著網際網路的日益發達，使用者的素質參差不齊，動不動就罵人，如果這在一個網站上顯示，肯定是不好的，所以對此現象，基於AC自動機的敏感詞過濾系統就誕生了。

ps:偷偷告訴你個祕密：這是一個閹割壓縮版的敏感詞過濾系統

思考

AC 自動機實際上就是在 Trie 樹之上，加了類似 KMP 的 next 陣列（只不過這裡的next陣列是構建在Trie樹上）。還是要改造的，在trie樹的基礎上加了一個fail的指標，當匹配不上的時候，儘可能的在樹上滑動，說人話就是大大減少了遍歷的次數，提升了匹配效率。

ps 這是一種字尾字串匹配演算法

改造

在原有基礎上，加了一個fail的指標，並且AC自動機的跳轉是通過fail指標來實現的。

    private class AcNode {
        /**
         * 字元
         */
        public char data;
        /**
         * 子節點
         */
        Map<Character, AcNode> children;
        /**
         * 結束標識
         */
        boolean isEndingChar;
        /**
         * 失敗指標
         */
        AcNode fail;

        AcNode(char data) {
            this.data = data;
            children = new HashMap<>();
            isEndingChar = false;
        }
    }複製程式碼

初始化

/**
     * 根節點
     */
    private AcNode root;

    private AhoCorasick() {
        root = new AcNode('/');
    }複製程式碼

插入

    private void insert(String txt) {
        AcNode p = root;
        char[] txtChar = txt.toCharArray();
        for (Character c : txtChar) {
            if (!p.children.containsKey(c)) {
                AcNode trieNode = new AcNode(c);
                p.children.put(c, trieNode);
            }
            p = p.children.get(c);
        }
        p.isEndingChar = true;
    }複製程式碼

構建失敗指標

這個方法是關鍵。

    private void buildFailurePointer() {
        Queue<AcNode> queue = new LinkedList<>();
        root.fail = null;
        queue.offer(root);
        while (!queue.isEmpty()) {
            AcNode p = queue.poll();
            for (char c : p.children.keySet()) {
                AcNode pChild = p.children.get(c);
                if (null == pChild) continue;
                if (root == p) {
                    pChild.fail = root;
                } else {
                    AcNode q = p.fail;
                    while (null != q) {
                        AcNode qChild = q.children.get(p.data);
                        if (null != qChild) {
                            pChild.fail = qChild;
                            break;
                        }
                        q = q.fail;
                    }
                    if (null == q) {
                        pChild.fail = root;
                    }
                }
                queue.offer(pChild);
            }
        }
    }複製程式碼

匹配

    private boolean match(String txt) {
        char[] txtChars = txt.toCharArray();
        AcNode p = root;
        for (char c : txtChars) {
            while (p != root && null == p.children.get(c)) {
                p = p.fail;
            }
            p = p.children.get(c);
            //如果沒有匹配，從root重新開始
            if (null == p) p = root;
            AcNode temp = p;
            while (temp != root) {
                if (temp.isEndingChar) {
                    return true;
                }
                temp = temp.fail;
            }
        }
        return false;
    }複製程式碼

構建敏感詞Trie樹

    private void generate() {
        String[] strs = new String[]{"so", "hel", "oh", "llo"};
        for (int i = 0; i < strs.length; i++) {
            insert(strs[i]);
        }
    }複製程式碼

測試程式碼

這裡加了一個Map，用來做快取，如果已經匹配上了，直接替換就可以了，提升效率。mapCache的value就是key出現的次數，起一個計數的作用。

    public static void main(String[] args) {
        AhoCorasick ac = new AhoCorasick();
        ac.generate();
        ac.buildFailurePointer();
        String txt = "he her hello home so see say just so so hello world";
        System.out.println("主串");
        System.out.println("[" + txt + "]");
        System.out.println("敏感詞:");
        System.out.println("so,hel,oh,llo");
        String[] strs = txt.split(" ");
        Map<String, Integer> mapCache = new HashMap<>();
        for (int i = 0; i < strs.length; i++) {
            if (mapCache.containsKey(strs[i])) {
                int index = mapCache.get(strs[i]);
                mapCache.put(strs[i], ++index);
                strs[i] = "****";
            } else {
                boolean res = ac.match(strs[i]);
                //如果匹配到，將其替換成****
                if (res) {
                    mapCache.put(strs[i], 1);
                    strs[i] = "****";
                }
            }
        }
        System.out.println("經過敏感詞系統過濾後...");
        System.out.println(Arrays.toString(strs));
        for (String str:mapCache.keySet()){
            System.out.println(str + "出現的次數為" + mapCache.get(str));
        }
    }複製程式碼

測試結果

end

您的點贊和關注是對我最大的支援，謝謝！

Trie樹，字典樹
2024-08-17
trie字典樹
2023-09-25
字典樹Trie
2024-11-12
字典樹(Trie)
2024-07-09
208. 實現 Trie (字首樹)-python
2024-05-06
Python
雙陣列TRIE樹Double-Array Trie理解引導
2020-11-25
陣列
一個簡單的統計問題（解決方案：Trie樹）
2019-02-21
線段樹也能是 Trie 樹題解
2024-11-02
AC自動機+trie樹實現高效多模式匹配字典
2018-07-09
模式
雙陣列字典樹(Double Array Trie)
2020-11-18
陣列
Trie樹：字串頻率統計排序
2021-09-09
字串排序
大資料零基礎由入門到實戰
2018-05-03
大資料
《Flink入門與實戰》簡介
2023-01-19
基於PHP + TRIE樹實現敏感詞過濾演算法
2019-04-16
PHP演算法
怎樣實現基於Trie樹和字典的分詞功能
2018-06-06
分詞
[leetcode/lintcode 題解] 微軟面試題：實現 Trie（字首樹）
2020-11-17
LeetCode微軟面試題
Trie tree實踐
2019-01-08
字串演算法--$\mathcal{KMP，Trie}$樹
2023-03-29
字串演算法KMP
LeetCode 1032. Stream of Characters 4行Trie樹
2021-01-04
LeetCode
【資料結構和演算法】Trie樹簡介及應用詳解
2023-01-30
資料結構演算法
大資料系列零基礎由入門到實戰
2018-07-03
大資料
cf888G. Xor-MST(Boruvka最小生成樹 Trie樹)
2018-10-14
[翻譯]資料結構——trie樹介紹
2019-02-28
資料結構
【動畫】看動畫輕鬆理解「Trie樹」
2019-01-02
動畫
字典樹(字首樹)簡單實現
2020-11-15
Trie樹【P3879】 [TJOI2010]閱讀理解
2020-04-05
Trie|如何用字典樹實現搜尋引擎的關鍵詞提示功能
2019-06-25
win10輸入法簡繁切換怎麼設定_win10電腦輸入法如何切換簡體繁體
2020-05-31
Win10
淺談樹形結構的特性和應用（上）:多叉樹，紅黑樹，堆，Trie樹，B樹，B+樹...
2020-08-01
雙陣列Trie樹高效構建有向無環圖
2018-07-19
陣列
AI實戰 | 由淺入深，手把手帶你實現Java轉型學習助手
2024-03-01
AIJava
Trie
2024-10-06
機器學習實戰（三）決策樹ID3：樹的構建和簡單分類
2018-05-17
機器學習
網站必備之簡繁切換功能實現
2019-02-16
網站
中文簡繁體轉化
2018-09-07
中文簡體轉繁體
2020-11-13
二叉樹的簡單實戰 → 一起溫故下二叉樹的遍歷
2022-01-04
二叉樹
web前端入門到實戰：非同步載入CSS最簡單的實現方式
2019-11-03
Web前端非同步CSS

由簡入繁--Trie樹實戰

日誌

引言

ps:Trie 樹的本質，就是利用字串之間的公共字首，將重複的字首合並在一起。

模擬搜尋關鍵詞提示功能

宣告一個tire類

初始化

插入

查詢

模糊提示匹配

遞迴遍歷節點

測試程式碼

測試程式碼

測試結果

統計字串中指定字元出現的次數

思考

改造TireNode類

初始化

改造插入方法

統計方法

測試程式碼

測試結果

基於AC自動機的敏感詞過濾系統

思考

改造

初始化

插入

構建失敗指標

匹配

構建敏感詞Trie樹

測試程式碼

測試結果

end

相關文章