Rasa中使用lookup table時針對中文對RegexEntityExtractor進行修改

SoCalledHBY發表於2020-11-17

原文網址 : https://blog.csdn.net/Humanlike_/article/details/109731727

環境：Python 3.7.9
Rasa 2.0.6
Rasa SDK 2.0.0

一、問題

博主在使用Rasa做中文問答時遇到了一個問題：新增form，slot filling使用from_entity，並在pipeline中新增RegexEntityExtractor。假設該entity為city，在nlu.yml中僅新增了鄭州作為training data，且在nlu.yml中新增了city的lookup table，如圖。

nlu.yml
但在實際對話中，除了鄭州可以被DIETClassifier識別到，lookup table中沒有出現在training data中的的例子均無法正常auto fill，如圖。

rasa shell nlu

二、分析

為什麼RegexEntityExtractor無法識別lookup table中的例子呢？在官方文件查詢無果，於是果斷從原始碼入手，分析RegexEntityExtractor。

# regex_entity_extractor.py

import rasa.nlu.utils.pattern_utils as pattern_utils
...

class RegexEntityExtractor(EntityExtractor):
    """Searches for entities in the user's message using the lookup tables and regexes
    defined in the training data."""

	...

    def train(
        self,
        training_data: TrainingData,
        config: Optional[RasaNLUModelConfig] = None,
        **kwargs: Any,
    ) -> None:
        self.patterns = pattern_utils.extract_patterns(
            training_data,
            use_lookup_tables=self.component_config["use_lookup_tables"],
            use_regexes=self.component_config["use_regexes"],
            use_only_entities=True,
        )

        if not self.patterns:
            rasa.shared.utils.io.raise_warning(
                "No lookup tables or regexes defined in the training data that have "
                "a name equal to any entity in the training data. In order for this "
                "component to work you need to define valid lookup tables or regexes "
                "in the training data."
            )

	...

    def _extract_entities(self, message: Message) -> List[Dict[Text, Any]]:
        """Extract entities of the given type from the given user message."""

		...

        for pattern in self.patterns:
            matches = re.finditer(pattern["pattern"], message.get(TEXT), flags=flags)
            matches = list(matches)

            for match in matches:
                start_index = match.start()
                end_index = match.end()
                entities.append(
                    {
                        ENTITY_ATTRIBUTE_TYPE: pattern["name"],
                        ENTITY_ATTRIBUTE_START: start_index,
                        ENTITY_ATTRIBUTE_END: end_index,
                        ENTITY_ATTRIBUTE_VALUE: message.get(TEXT)[
                            start_index:end_index
                        ],
                    }
                )

        return entities
	...

這裡只列出兩個最重要的方法。
可以看到，在_extract_entities方法中RegexEntityExtractor使用了self.patterns中的正規表示式對使用者的輸入進行匹配，在train方法中可以看到，self.patterns是通過呼叫了pattern_utils的extract_patterns方法得到的，於是繼續追蹤。

# pattern_utils.py

def _convert_lookup_tables_to_regex(
    training_data: TrainingData, use_only_entities: bool = False
) -> List[Dict[Text, Text]]:
    """Convert the lookup tables from the training data to regex patterns.
    Args:
        training_data: The training data.
        use_only_entities: If True only regex features with a name equal to a entity
          are considered.

    Returns:
        A list of regex patterns.
    """
    patterns = []
    for table in training_data.lookup_tables:
        if use_only_entities and table["name"] not in training_data.entities:
            continue
        regex_pattern = _generate_lookup_regex(table)
        lookup_regex = {"name": table["name"], "pattern": regex_pattern}
        patterns.append(lookup_regex)
    return patterns


def _generate_lookup_regex(lookup_table: Dict[Text, Union[Text, List[Text]]]) -> Text:
    """Creates a regex pattern from the given lookup table.

    The lookup table is either a file or a list of entries.

    Args:
        lookup_table: The lookup table.

    Returns:
        The regex pattern.
    """
    lookup_elements = lookup_table["elements"]

    # if it's a list, it should be the elements directly
    if isinstance(lookup_elements, list):
        elements_to_regex = lookup_elements
    # otherwise it's a file path.
    else:
        elements_to_regex = read_lookup_table_file(lookup_elements)

    # sanitize the regex, escape special characters
    elements_sanitized = [re.escape(e) for e in elements_to_regex]

    # regex matching elements with word boundaries on either side
    return "(\\b" + "\\b|\\b".join(elements_sanitized) + "\\b)"

...

def extract_patterns(
    training_data: TrainingData,
    use_lookup_tables: bool = True,
    use_regexes: bool = True,
    use_only_entities: bool = False,
) -> List[Dict[Text, Text]]:
    """Extract a list of patterns from the training data.

    The patterns are constructed using the regex features and lookup tables defined
    in the training data.

    Args:
        training_data: The training data.
        use_only_entities: If True only lookup tables and regex features with a name
          equal to a entity are considered.
        use_regexes: Boolean indicating whether to use regex features or not.
        use_lookup_tables: Boolean indicating whether to use lookup tables or not.

    Returns:
        The list of regex patterns.
    """
    if not training_data.lookup_tables and not training_data.regex_features:
        return []

    patterns = []

    if use_regexes:
        patterns.extend(_collect_regex_features(training_data, use_only_entities))
    if use_lookup_tables:
        patterns.extend(
            _convert_lookup_tables_to_regex(training_data, use_only_entities)
        )

    return patterns

這裡只列出三個最重要的方法。
可以看到，在extract_patterns方法中會判斷使用者是否開啟了use_lookup_tables選項，如果啟用，則呼叫_convert_lookup_tables_to_regex方法，即將lookup table轉換為regex。在官方文件中，我們也可以看到，查詢表是需要轉換為正規表示式進行匹配的：

Lookup tables are lists of words used to generate case-insensitive regular expression patterns.
查詢表是用來生成大小寫敏感的正規表示式的單詞的列表。

繼續追蹤，在_convert_lookup_tables_to_regex方法中可以看到，正規表示式又是呼叫_generate_lookup_regex方法生成的。最終，我們來到了_generate_lookup_regex方法，發現了事情的真相。直接看return的部分，我們發現，返回的正規表示式並不是簡單地將查詢表中的例子用|連線起來，而是在每個例子前後都加上了一個\b，而這個\b就是問題的關鍵。經過搜尋（博主並不擅長正規表示式，見諒），原來\b是為了在匹配時只匹配邊界的例子，如er\b可以匹配never中的er，但不能匹配verb中的er，而中文的單詞間並沒有空格，導致句子中的例子無法被識別。

三、解決

真相大白，只需將rasa/nlu/utils/pattern_utils/pattern_utils.py中_generate_lookup_regex方法中的返回值中的\\b刪去，即可得到適合中文的RegexEntityExtractor。

四、解決？

正當博主以為真相已經水落石出之時，卻偶然發現，北京、上海等大城市並沒有出現該問題，可以被DIETClassifier正常auto fill，甚至不新增至lookup table也可以識別…目測可能和博主使用的預訓練模型有關，有待進一步求證。

可以同時對副檔名進行修改嗎？
2022-06-08
PostgreSQL中對日期時間進行分組
2024-10-18
SQL
【資料分析】針對家庭用電資料進行時序分析（1）
2023-09-26
對系統快捷生成字典進行修改
2018-04-14
針對使用非塊執行和塊執行併發壓測對比
2018-04-15
使用nodejs對Marketing Cloud的contact主資料進行修改操作
2019-05-25
NodeJSCloud
如何對php網站頁面進行修改
2024-10-09
PHP網站
Java使用Collections對中文字元進行首字母排序
2020-12-13
Java字元排序
使用cProfile針對回測進行效能分析，和結合說下提速思路
2019-10-16
Win10設定在播放視訊時針對視訊進行優化方法
2020-10-24
Win10優化
Win10設定在播放影片時針對影片進行最佳化方法
2020-10-24
Win10
使用springboot對linux進行操控
2020-11-23
Spring BootLinux
oracle 12c 針對cdb的差異0備與對pdb進行恢復
2019-04-01
Oracle
[UE] Data Table 對比工具 —— 用於 Data Table 對比以前的資料，檢視有什麼修改
2024-07-21
對 LLM 工具使用進行統一
2024-09-26
針對flume中扇出複用（源exec）原始碼修改，並編譯flume
2018-04-14
原始碼編譯
imtoken中如何進行EOS對映
2018-05-29
vue中對axios進行封裝
2018-12-25
VueiOS封裝
兩組資料量相對大時，如何高效進行比對
2022-06-14
使用dnSpy對無原始碼EXE或DLL進行反編譯並且修改
2022-07-03
DNS原始碼編譯
Appium自動化（15） - 針對 webview 進行自動化測試
2020-05-21
APPWebView
教你如何修改執行中的容器埠對映
2019-02-26
怎樣做好保護網站並對其進行修改
2020-11-25
網站
SQLServer對錶進行CDC捕捉時報錯
2022-05-15
SQLServer
使用 OSProfiler 對 OpenStack 進行效能測量
2019-07-25
使用ffmpeg對視訊進行TS切片
2020-11-05
使用cJSON庫對JSON格式進行解析
2024-06-13
JSON
vue對table的某一行的資料進行編輯，刪除操作
2020-09-30
Vue
使用 SOS 對 Linux 中執行的 .NET Core 進行問題診斷
2021-01-03
Linux
C# 針對特定的條件進行鎖操作，不用lock，而是mutex
2019-07-22
C#Mutex
043、Vue3+TypeScript基礎，pinia庫使用action，在函式中對儲存資料進行修改
2024-08-22
VueTypeScript函式
對字典進行排序
2018-06-07
排序
南亞APT組織Bitter正在針對中國的核能機構進行網路攻擊
2023-03-31
APT
使用LVM對系統盤進行擴容
2018-11-17
LVM
.NET CORE 中使用AutoMapper進行物件對映
2018-07-12
APP物件
Terraform 系列-使用Dynamic Blocks對Blocks進行迭代
2023-10-31
ORMBloC
使用 locust 對 mysql 語句進行壓測
2020-08-24
MySql
使用更新補丁對Android Studio進行更新。
2020-10-13
Android

Rasa中使用lookup table時針對中文對RegexEntityExtractor進行修改

一、問題

二、分析

三、解決

四、解決？

相關文章