【資料結構和演算法】Trie樹簡介及應用詳解

京東雲開發者發表於2023-01-30

作者:京東物流 馬瑞

1 什麼是Trie樹

1.1 Trie樹的概念

Trie樹,即字典樹,又稱單詞查詢樹或鍵樹,是一種樹形結構,典型應用是用於統計,排序和儲存大量的字串(但不僅限於字串),所以經常被搜尋引擎系統用於文字詞頻統計。它的優點是:利用字串的公共字首來減少查詢時間,最大限度地減少無謂的字串比較,查詢效率比雜湊樹高。

Trie, also called digital tree and sometimes radix tree or prefix tree (as they can be searched by prefixes), is a kind of search tree—an ordered tree data structure that is used to store a dynamic set or associative array where the keys are usually strings. It is one of those data-structures that can be easily implemented.

1.2 Trie樹優點

最大限度地減少無謂的字串比較,查詢效率比較高。核心思想是空間換時間,利用字串的公共字首來降低查詢時間的開銷以達到提高效率的目的。

  1. 插入、查詢的時間複雜度均為O(N),其中N為字串長度。
  2. 空間複雜度是26^n級別的,非常龐大(可採用雙陣列實現改善)。

1.3 Trie樹的三個基本性質

  1. 根節點不包含字元,除根節點外每一個節點都只包含一個字元
  2. 從根節點到某一節點,路徑上經過的字元連線起來,為該節點對應的字串
  3. 每個節點的所有子節點包含的字元都不相同

2 Trie樹資料結構

以字串”hi”和”經海路”的資料為例:

Java的資料結構定義:

@Data
public class TrieTreeNode {
    private Character data;
    private Map<Character, TrieTreeNode> children;
    private boolean isEnd;
    //  字首,冗餘資訊,可選
    private String prefix;
    //  字尾,冗餘資訊,可選
    private String suffix;
}

如果只是處理26個英文字元,data可以透過children陣列是否為空來判斷。如果處理程式,預設children為空來判斷是否為最後一個節點,則isEnd欄位可以省略。
字首prefix和suffix可以在資料處理的時候,方便拿到當前節點字首和字尾資訊,如果不需要可以去除。

3 Trie樹在髒話過濾中的應用

3.1 髒話關鍵詞Keyword定義

public class KeyWord implements Serializable {
    /**
     * 關鍵詞內容
     */
    private String word;
//其他省略
}

3.2 關鍵詞查詢器

public class KWSeeker {

    /**
     * 所有的關鍵詞
     */
    private Set<KeyWord> sensitiveWords;

    /**
     * 關鍵詞樹
     */
    private Map<String, Map> wordsTree = new ConcurrentHashMap<String, Map>();

    /**
     * 最短的關鍵詞長度。用於對短於這個長度的文字不處理的判斷,以節省一定的效率
     */
    private int wordLeastLen = 0;

//其他處理方法,省略
}

3.3 字串構造一棵樹

/**
    * 將指定的詞構造到一棵樹中。
    * 
    * @param tree 構造出來的樹
    * @param word 指定的詞
    * @param KeyWord 對應的詞
    * @return
    */
public static Map<String, Map> makeTreeByWord(Map<String, Map> tree, String word,
        KeyWord KeyWord) {
    if (StringUtils.isEmpty(word)) {
        tree.putAll(EndTagUtil.buind(KeyWord));
        return tree;
    }
    String next = word.substring(0, 1);
    Map<String, Map> nextTree = tree.get(next);
    if (nextTree == null) {
        nextTree = new HashMap<String, Map>();
    }
    // 遞迴構造樹結構
    tree.put(next, makeTreeByWord(nextTree, word.substring(1), KeyWord));
    return tree;
}

對關鍵詞字串,逐個字元進行構建。

3.4 詞庫樹的生成

/**
    * 構造、生成詞庫樹。並返回所有敏感詞中最短的詞的長度。
    * 
    * @param sensitiveWords 詞庫
    * @param wordsTree 聚合詞庫的樹
    * @return 返回所有敏感詞中最短的詞的長度。
    */
public int generalTree(Set<KeyWord> sensitiveWords, Map<String, Map> wordsTree) {
    if (sensitiveWords == null || sensitiveWords.isEmpty() || wordsTree == null) {
        return 0;
    }

    wordsTreeTmp.clear();
    int len = 0;
    for (KeyWord kw : sensitiveWords) {
        if (len == 0) {
            len = kw.getWordLength();
        } else if (kw.getWordLength() < len) {
            len = kw.getWordLength();
        }
        AnalysisUtil
                .makeTreeByWord(wordsTreeTmp, StringUtils.trimToEmpty(kw.getWord()), kw);
    }
    wordsTree.clear();
    wordsTree.putAll(wordsTreeTmp);
    return len;
}

3.5 關鍵詞提取

/**
 * 將文字中的關鍵詞提取出來。
 */
public List<SensitiveWordResult> process(Map<String, Map> wordsTree, String text,
        AbstractFragment fragment, int minLen) {
    // 詞的前面一個字
    String pre = null;
    // 詞匹配的開始位置
    int startPosition = 0;
    // 返回結果
    List<SensitiveWordResult> rs = new ArrayList<SensitiveWordResult>();

    while (true) {
        try {
            if (wordsTree == null || wordsTree.isEmpty() || StringUtils.isEmpty(text)) {
                return rs;
            }
            if (text.length() < minLen) {
                return rs;
            }
            String chr = text.substring(0, 1);
            text = text.substring(1);
            Map<String, Map> nextWord = wordsTree.get(chr);
            // 沒有對應的下一個字,表示這不是關鍵詞的開頭,進行下一個迴圈
            if (nextWord == null) {
                pre = chr;
                continue;
            }

            List<KeyWord> keywords = Lists.newArrayList();
            KeyWord kw = AnalysisUtil.getSensitiveWord(chr, pre, nextWord, text, keywords);
            if (keywords == null || keywords.size() == 0) {
                // 沒有匹配到完整關鍵字,下一個迴圈
                pre = chr;
                continue;
            }
            for (KeyWord tmp : keywords) {
                // 同一個word多次出現記錄在一起
                SensitiveWordResult result = new SensitiveWordResult(startPosition, tmp.getWord());
                int index = rs.indexOf(result);
                if (index > -1) {
                    rs.get(index).addPosition(startPosition, tmp.getWord());
                } else {
                    rs.add(result);
                }
            }

            // 從text中去除當前已經匹配的內容,進行下一個迴圈匹配
            // 這行註釋了,避免"中國人",導致"國人"。搜尋不出來,逐個字元遍歷
            // text = text.substring(kw.getWordLength() - 1);
            pre = kw.getWord().substring(kw.getWordLength() - 1, kw.getWordLength());
            continue;
        } finally {
            if (pre != null) {
                startPosition = startPosition + pre.length();
            }
        }

    }
}

/**
 * 查詢文字開頭的詞是否在詞庫樹中,如果在,則返回對應的詞,如果不在,則返回null。return 返回找到的最長關鍵詞
 * 
 * @param append 追加的詞
 * @param pre 詞的前一個字,如果為空,則表示前面沒有內容
 * @param nextWordsTree 下一層樹
 * @param text 剩餘的文字內容
 * @param keywords 返回的keywords,可能多個
 * @return 返回找到的最長關鍵詞
 */
public static KeyWord getSensitiveWord(String append, String pre,
                Map<String, Map> nextWordsTree, String text, List<KeyWord> keywords) {
    if (nextWordsTree == null || nextWordsTree.isEmpty()) {
        return null;
    }

    Map<String, Object> endTag = nextWordsTree.get(EndTagUtil.TREE_END_TAG);
    // 原始文字已到末尾
    if (StringUtils.isEmpty(text)) {
        // 如果有結束符,則表示匹配成功,沒有,則返回null
        if (endTag != null) {
            keywords.add(checkPattern(getKeyWord(append, endTag), pre, null));
            return checkPattern(getKeyWord(append, endTag), pre, null);
        } else {
            return null;
        }
    }

    String next = text.substring(0, 1);
    String suffix = text.substring(0, 1);
    Map<String, Map> nextTree = nextWordsTree.get(next);

    // 沒有遇到endTag,繼續匹配
    if (endTag == null) {
        if (nextTree != null && nextTree.size() > 0) {
            // 沒有結束標誌,則表示關鍵詞沒有結束,繼續往下走。
            return getSensitiveWord(append + next, pre, nextTree, text.substring(1), keywords);
        }

        // 如果沒有下一個匹配的字,表示匹配結束!
        return null;
    } else { // endTag , 新增關鍵字
        KeyWord tmp = getKeyWord(append, endTag);
        keywords.add(checkPattern(tmp, pre, suffix));
    }

    // 有下一個匹配的詞則繼續匹配,一直取到最大的匹配關鍵字
    KeyWord tmp = null;
    if (nextTree != null && nextTree.size() > 0) {
        // 如果大於0,則表示還有更長的詞,繼續往下找
        tmp = getSensitiveWord(append + next, pre, nextTree, text.substring(1), keywords);
        if (tmp == null) {
            // 沒有更長的詞,則就返回這個詞。在返回之前,先判斷它是模糊的,還是精確的
            tmp = getKeyWord(append, endTag);
        }
        return checkPattern(tmp, pre, suffix);
    }

    // 沒有往下的詞了,返回該關鍵詞。
    return checkPattern(getKeyWord(append, endTag), pre, suffix);

}

思路是對某個字串text,逐個字元ch,獲取ch對應的詞庫樹的children,然後獲取匹配到的單個或多個結果,將匹配到的關鍵詞在text中的開始和結束下標進行記錄,如後續需要html標記,或者字元替換可直接使用。如果未能在詞庫樹中找到對應的ch的children,或者詞庫樹的children未能匹配到去除ch的子字串,則繼續迴圈。具體可再詳細讀一下程式碼。

4 Radix Tree的應用

4.1 RAX - Redis Tree

Redis實現了不定長壓縮字首的radix tree,用在叢集模式下儲存slot對應的的所有key資訊。

/* Representation of a radix tree as implemented in this file, that contains
 * the strings "foo", "foobar" and "footer" after the insertion of each
 * word. When the node represents a key inside the radix tree, we write it
 * between [], otherwise it is written between ().
 *
 * This is the vanilla representation:
 *
 *              (f) ""
 *                \
 *                (o) "f"
 *                  \
 *                  (o) "fo"
 *                    \
 *                  [t   b] "foo"
 *                  /     \
 *         "foot" (e)     (a) "foob"
 *                /         \
 *      "foote" (r)         (r) "fooba"
 *              /             \
 *    "footer" []             [] "foobar"
 *
 * However, this implementation implements a very common optimization where
 * successive nodes having a single child are "compressed" into the node
 * itself as a string of characters, each representing a next-level child,
 * and only the link to the node representing the last character node is
 * provided inside the representation. So the above representation is turned
 * into:
 *
 *                  ["foo"] ""
 *                     |
 *                  [t   b] "foo"
 *                  /     \
 *        "foot" ("er")    ("ar") "foob"
 *                 /          \
 *       "footer" []          [] "foobar"
 *
 * However this optimization makes the implementation a bit more complex.
 * For instance if a key "first" is added in the above radix tree, a
 * "node splitting" operation is needed, since the "foo" prefix is no longer
 * composed of nodes having a single child one after the other. This is the
 * above tree and the resulting node splitting after this event happens:
 *
 *
 *                    (f) ""
 *                    /
 *                 (i o) "f"
 *                 /   \
 *    "firs"  ("rst")  (o) "fo"
 *              /        \
 *    "first" []       [t   b] "foo"
 *                     /     \
 *           "foot" ("er")    ("ar") "foob"
 *                    /          \
 *          "footer" []          [] "foobar"
 *
 * Similarly after deletion, if a new chain of nodes having a single child
 * is created (the chain must also not include nodes that represent keys),
 * it must be compressed back into a single node.
 *
 */
#define RAX_NODE_MAX_SIZE ((1<<29)-1)
typedef struct raxNode {
    uint32_t iskey:1;     /* Does this node contain a key? */
    uint32_t isnull:1;    /* Associated value is NULL (don't store it). */
    uint32_t iscompr:1;   /* Node is compressed. */
    uint32_t size:29;     /* Number of children, or compressed string len. */
    unsigned char data[];
} raxNode;

typedef struct rax {
    raxNode *head;
    uint64_t numele;
    uint64_t numnodes;
} rax;

typedef struct raxStack {
    void **stack; /* Points to static_items or an heap allocated array. */
    size_t items, maxitems; /* Number of items contained and total space. */
    void *static_items[RAX_STACK_STATIC_ITEMS];
    int oom; /* True if pushing into this stack failed for OOM at some point. */
} raxStack;

如Redis原始碼中的註釋所寫,RAX進行了一些最佳化,並不會將一個字串直接按照每個字元進行樹的構建,而是在Insert有衝突時節點分割處理,在Delete時如果子節點和父節點都只有一個,則需要進行合併操作。
對於RAX有興趣的同學,可以看一下rax.h、rax.c的相關程式碼。

4.2 Linux核心

Linux radix樹最廣泛的用途是用於記憶體管理,結構address_space透過radix樹跟蹤繫結到地址對映上的核心頁,該radix樹允許記憶體管理程式碼快速查詢標識為dirty或writeback的頁。Linux radix樹的API函式在lib/radix-tree.c中實現。
Linux基數樹(radix tree)是將指標與long整數鍵值相關聯的機制,它儲存有效率,並且可快速查詢,用於指標與整數值的對映(如:IDR機制)、記憶體管理等。

struct radix_tree_node {
        unsigned int    path;
        unsigned int    count;
        union {
                struct {
                        struct radix_tree_node *parent;
                        void *private_data;
                };
                struct rcu_head rcu_head;
        };
        /* For tree user */
        struct list_head private_list;
        void __rcu      *slots[RADIX_TREE_MAP_SIZE];
        unsigned long   tags[RADIX_TREE_MAX_TAGS][RADIX_TREE_TAG_LONGS];
};

關於Linux核心使用Radix Tree的具體程式碼,有興趣的同學可以繼續深入。

5 總結

Trie樹在單詞搜尋、統計、排序等領域有大量的應用。文章從基礎概念到具體的髒話過濾的應用、Redis的RAX和Linux核心的Radix Tree對Trie樹做了介紹。資料結構和演算法是程式高效能的基礎,本文拋磚引玉,希望大家對Trie樹有所瞭解,並在未來開發過程實踐和應用Trie樹解決中類似情景的問題。

相關文章