ES 筆記十三:多欄位特性及 Mapping 中配置自定義 Analyzer

CrazyZard發表於2019-10-26
  • 多欄位特性
    • 廠家名字實現精確匹配
      • 增加一個keyword欄位
    • 使用不同的analyzer
      • 不同語言
      • pinyin 欄位的搜尋
      • 還支援為搜尋和索引指定不同的analyzer

ES 筆記十三:多欄位特性及Mapping中配置自定義Analyzer

  • Excat Values :包括數字 / 日期 / 具體一個字串 (例如 "Apple Store")
    • Elasticsearch 中的keyword
    • 全文字,非結構化的文字資料
      • Elasticsearch 中的 text

ES 筆記十三:多欄位特性及Mapping中配置自定義Analyzer

  • Elaticsearch 為每一個欄位建立一個倒排索引
    • Exact Value 在索引時,不需要做特殊的分詞處理

ES 筆記十三:多欄位特性及Mapping中配置自定義Analyzer

  • 當Elasticsearch自帶的分詞器無法滿足時,可以自定義分詞器。通過自組合不同的元件實現
    • Character Filter
    • Tokenizer
    • Token Filter
  • Tokenizer 之前對文字進行處理,例如增加刪除及替換字元。可以配置多個Character Filters。會影響 Tokenizerpositionoffset資訊
  • 一些自帶的Character Filters
    • HTML strip - 去除html標籤
    • Mapping - 字串替換
    • Pattern replace - 正則匹配替換
  • 將原始的文字按照一定的規則,切分為詞(term or token)
  • Elasticsearch 內建的 Tokenizers
    • whitespace | standard | uax_url_email | pattern | keyword | path hierarchy
  • 可以用JAVA 開發外掛,實現自己的 Tokenizer
  • Tokenizer輸出的單詞,進行增加、修改、刪除
  • 自帶的Token Filters
    • Lowercase |stop| synonym(新增近義詞)
  • char_filter

    POST _analyze
    {
    "tokenizer":"keyword",
    "char_filter":["html_strip"],
    "text": "<b>hello world</b>"
    }
    //結果
    {
    "tokens" : [
    {
      "token" : "hello world",
      "start_offset" : 3,
      "end_offset" : 18,
      "type" : "word",
      "position" : 0
    }
    ]
    }
  • 使用char filter進行替換

    POST _analyze
    {
    "tokenizer": "standard",
    "char_filter": [
      {
        "type" : "mapping",
        "mappings" : [ "- => _"]
      }
    ],
    "text": "123-456, I-test! test-990 650-555-1234"
    }
    //返回
    {
    "tokens" : [
    {
      "token" : "123_456",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "<NUM>",
      "position" : 0
    },
    {
      "token" : "I_test",
      "start_offset" : 9,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "test_990",
      "start_offset" : 17,
      "end_offset" : 25,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "650_555_1234",
      "start_offset" : 26,
      "end_offset" : 38,
      "type" : "<NUM>",
      "position" : 3
    }
    ]
    }
  • char filter 替換表情符號

    POST _analyze
    {
    "tokenizer": "standard",
    "char_filter": [
      {
        "type" : "mapping",
        "mappings" : [ ":) => happy", ":( => sad"]
      }
    ],
    "text": ["I am felling :)", "Feeling :( today"]
    }
    //返回
    {
    "tokens" : [
    {
      "token" : "I",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "am",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "felling",
      "start_offset" : 5,
      "end_offset" : 12,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "happy",
      "start_offset" : 13,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "Feeling",
      "start_offset" : 16,
      "end_offset" : 23,
      "type" : "<ALPHANUM>",
      "position" : 104
    },
    {
      "token" : "sad",
      "start_offset" : 24,
      "end_offset" : 26,
      "type" : "<ALPHANUM>",
      "position" : 105
    },
    {
      "token" : "today",
      "start_offset" : 27,
      "end_offset" : 32,
      "type" : "<ALPHANUM>",
      "position" : 106
    }
    ]
    }
  • 正規表示式

    GET _analyze
    {
    "tokenizer": "standard",
    "char_filter": [
      {
        "type" : "pattern_replace",
        "pattern" : "http://(.*)",
        "replacement" : "$1"
      }
    ],
    "text" : "http://www.elastic.co"
    }
    //返回
    {
    "tokens" : [
    {
      "token" : "www.elastic.co",
      "start_offset" : 0,
      "end_offset" : 21,
      "type" : "<ALPHANUM>",
      "position" : 0
    }
    ]
    }
  • 通過路勁切分
    POST _analyze
    {
    "tokenizer":"path_hierarchy",
    "text":"/user/ymruan/a"
    }
    {
    "tokens" : [
    {
      "token" : "/user",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/user/ymruan",
      "start_offset" : 0,
      "end_offset" : 12,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/user/ymruan/a",
      "start_offset" : 0,
      "end_offset" : 14,
      "type" : "word",
      "position" : 0
    }
    ]
    }
  • token_filters
    GET _analyze
    {
    "tokenizer": "whitespace", 
    "filter": ["stop","snowball"], //on the a
    "text": ["The gilrs in China are playing this game!"]
    }
    {
    "tokens" : [
    {
      "token" : "The", //大寫的The 不做過濾
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "gilr",
      "start_offset" : 4,
      "end_offset" : 9,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "China",
      "start_offset" : 13,
      "end_offset" : 18,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "play",
      "start_offset" : 23,
      "end_offset" : 30,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "game!",
      "start_offset" : 36,
      "end_offset" : 41,
      "type" : "word",
      "position" : 7
    }
    ]
    }
  • 加入lowercase後,The被當成 stopword刪除
    GET _analyze
    {
    "tokenizer": "whitespace",
    "filter": ["lowercase","stop","snowball"],
    "text": ["The gilrs in China are playing this game!"]
    }
    {
    "tokens" : [
    {
      "token" : "gilr",
      "start_offset" : 4,
      "end_offset" : 9,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "china",
      "start_offset" : 13,
      "end_offset" : 18,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "play",
      "start_offset" : 23,
      "end_offset" : 30,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "game!",
      "start_offset" : 36,
      "end_offset" : 41,
      "type" : "word",
      "position" : 7
    }
    ]
    }
  • 官網自定義分詞器的標準格式
    官網看了一下,自定義分析器標準格式是:
    PUT /my_index
    {
    "settings": {
        "analysis": {
            "char_filter": { ... custom character filters ... },//字元過濾器
            "tokenizer": { ... custom tokenizers ... },//分詞器
            "filter": { ... custom token filters ... }, //詞單元過濾器
            "analyzer": { ... custom analyzers ... }
        }
    }
    }
  • 自定義分詞器
    #定義自己的分詞器
    PUT my_index
    {
    "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer":{
          "type":"custom",
          "char_filter":[
            "emoticons"
          ],
          "tokenizer":"punctuation",
          "filter":[
            "lowercase",
            "english_stop"
          ]
        }
      },
      "tokenizer": {
        "punctuation":{
          "type":"pattern",
          "pattern": "[ .,!?]"
        }
      },
      "char_filter": {
        "emoticons":{
          "type":"mapping",
          "mappings" : [ 
            ":) => happy",
            ":( => sad"
          ]
        }
      },
      "filter": {
        "english_stop":{
          "type":"stop",
          "stopwords":"_english_"
        }
      }
    }
    }
    }

相關文章