ES 筆記十三：多欄位特性及 Mapping 中配置自定義 Analyzer

CrazyZard發表於2019-10-26

筆記APP

多欄位特性
- 廠家名字實現精確匹配
  - 增加一個keyword欄位
- 使用不同的analyzer
  - 不同語言
  - pinyin 欄位的搜尋
  - 還支援為搜尋和索引指定不同的analyzer

ES 筆記十三：多欄位特性及Mapping中配置自定義Analyzer

Excat Values ：包括數字 / 日期 / 具體一個字串（例如 "Apple Store"）
- Elasticsearch 中的keyword
- 全文字，非結構化的文字資料
  - Elasticsearch 中的 text

ES 筆記十三：多欄位特性及Mapping中配置自定義Analyzer

Elaticsearch 為每一個欄位建立一個倒排索引
- Exact Value 在索引時，不需要做特殊的分詞處理

ES 筆記十三：多欄位特性及Mapping中配置自定義Analyzer

當Elasticsearch自帶的分詞器無法滿足時，可以自定義分詞器。通過自組合不同的元件實現
- Character Filter
- Tokenizer
- Token Filter

在 Tokenizer 之前對文字進行處理，例如增加刪除及替換字元。可以配置多個Character Filters。會影響 Tokenizer 的 position 和offset資訊
一些自帶的Character Filters
- HTML strip - 去除html標籤
- Mapping - 字串替換
- Pattern replace - 正則匹配替換

將原始的文字按照一定的規則，切分為詞（term or token）
Elasticsearch 內建的 Tokenizers
- whitespace | standard | uax_url_email | pattern | keyword | path hierarchy
可以用JAVA 開發外掛，實現自己的 Tokenizer

將Tokenizer輸出的單詞，進行增加、修改、刪除
自帶的Token Filters
- Lowercase |stop| synonym（新增近義詞）

char_filter

POST _analyze
{
"tokenizer":"keyword",
"char_filter":["html_strip"],
"text": "<b>hello world</b>"
}
//結果
{
"tokens" : [
{
  "token" : "hello world",
  "start_offset" : 3,
  "end_offset" : 18,
  "type" : "word",
  "position" : 0
}
]
}

使用char filter進行替換

POST _analyze
{
"tokenizer": "standard",
"char_filter": [
  {
    "type" : "mapping",
    "mappings" : [ "- => _"]
  }
],
"text": "123-456, I-test! test-990 650-555-1234"
}
//返回
{
"tokens" : [
{
  "token" : "123_456",
  "start_offset" : 0,
  "end_offset" : 7,
  "type" : "<NUM>",
  "position" : 0
},
{
  "token" : "I_test",
  "start_offset" : 9,
  "end_offset" : 15,
  "type" : "<ALPHANUM>",
  "position" : 1
},
{
  "token" : "test_990",
  "start_offset" : 17,
  "end_offset" : 25,
  "type" : "<ALPHANUM>",
  "position" : 2
},
{
  "token" : "650_555_1234",
  "start_offset" : 26,
  "end_offset" : 38,
  "type" : "<NUM>",
  "position" : 3
}
]
}

char filter 替換表情符號

POST _analyze
{
"tokenizer": "standard",
"char_filter": [
  {
    "type" : "mapping",
    "mappings" : [ ":) => happy", ":( => sad"]
  }
],
"text": ["I am felling :)", "Feeling :( today"]
}
//返回
{
"tokens" : [
{
  "token" : "I",
  "start_offset" : 0,
  "end_offset" : 1,
  "type" : "<ALPHANUM>",
  "position" : 0
},
{
  "token" : "am",
  "start_offset" : 2,
  "end_offset" : 4,
  "type" : "<ALPHANUM>",
  "position" : 1
},
{
  "token" : "felling",
  "start_offset" : 5,
  "end_offset" : 12,
  "type" : "<ALPHANUM>",
  "position" : 2
},
{
  "token" : "happy",
  "start_offset" : 13,
  "end_offset" : 15,
  "type" : "<ALPHANUM>",
  "position" : 3
},
{
  "token" : "Feeling",
  "start_offset" : 16,
  "end_offset" : 23,
  "type" : "<ALPHANUM>",
  "position" : 104
},
{
  "token" : "sad",
  "start_offset" : 24,
  "end_offset" : 26,
  "type" : "<ALPHANUM>",
  "position" : 105
},
{
  "token" : "today",
  "start_offset" : 27,
  "end_offset" : 32,
  "type" : "<ALPHANUM>",
  "position" : 106
}
]
}

正規表示式

GET _analyze
{
"tokenizer": "standard",
"char_filter": [
  {
    "type" : "pattern_replace",
    "pattern" : "http://(.*)",
    "replacement" : "$1"
  }
],
"text" : "http://www.elastic.co"
}
//返回
{
"tokens" : [
{
  "token" : "www.elastic.co",
  "start_offset" : 0,
  "end_offset" : 21,
  "type" : "<ALPHANUM>",
  "position" : 0
}
]
}

通過路勁切分

POST _analyze
{
"tokenizer":"path_hierarchy",
"text":"/user/ymruan/a"
}
{
"tokens" : [
{
  "token" : "/user",
  "start_offset" : 0,
  "end_offset" : 5,
  "type" : "word",
  "position" : 0
},
{
  "token" : "/user/ymruan",
  "start_offset" : 0,
  "end_offset" : 12,
  "type" : "word",
  "position" : 0
},
{
  "token" : "/user/ymruan/a",
  "start_offset" : 0,
  "end_offset" : 14,
  "type" : "word",
  "position" : 0
}
]
}

token_filters

GET _analyze
{
"tokenizer": "whitespace", 
"filter": ["stop","snowball"], //on the a
"text": ["The gilrs in China are playing this game!"]
}
{
"tokens" : [
{
  "token" : "The", //大寫的The 不做過濾
  "start_offset" : 0,
  "end_offset" : 3,
  "type" : "word",
  "position" : 0
},
{
  "token" : "gilr",
  "start_offset" : 4,
  "end_offset" : 9,
  "type" : "word",
  "position" : 1
},
{
  "token" : "China",
  "start_offset" : 13,
  "end_offset" : 18,
  "type" : "word",
  "position" : 3
},
{
  "token" : "play",
  "start_offset" : 23,
  "end_offset" : 30,
  "type" : "word",
  "position" : 5
},
{
  "token" : "game!",
  "start_offset" : 36,
  "end_offset" : 41,
  "type" : "word",
  "position" : 7
}
]
}

加入lowercase後，The被當成 stopword刪除

GET _analyze
{
"tokenizer": "whitespace",
"filter": ["lowercase","stop","snowball"],
"text": ["The gilrs in China are playing this game!"]
}
{
"tokens" : [
{
  "token" : "gilr",
  "start_offset" : 4,
  "end_offset" : 9,
  "type" : "word",
  "position" : 1
},
{
  "token" : "china",
  "start_offset" : 13,
  "end_offset" : 18,
  "type" : "word",
  "position" : 3
},
{
  "token" : "play",
  "start_offset" : 23,
  "end_offset" : 30,
  "type" : "word",
  "position" : 5
},
{
  "token" : "game!",
  "start_offset" : 36,
  "end_offset" : 41,
  "type" : "word",
  "position" : 7
}
]
}

官網自定義分詞器的標準格式

官網看了一下，自定義分析器標準格式是：
PUT /my_index
{
"settings": {
    "analysis": {
        "char_filter": { ... custom character filters ... },//字元過濾器
        "tokenizer": { ... custom tokenizers ... },//分詞器
        "filter": { ... custom token filters ... }, //詞單元過濾器
        "analyzer": { ... custom analyzers ... }
    }
}
}

自定義分詞器

#定義自己的分詞器
PUT my_index
{
"settings": {
"analysis": {
  "analyzer": {
    "my_custom_analyzer":{
      "type":"custom",
      "char_filter":[
        "emoticons"
      ],
      "tokenizer":"punctuation",
      "filter":[
        "lowercase",
        "english_stop"
      ]
    }
  },
  "tokenizer": {
    "punctuation":{
      "type":"pattern",
      "pattern": "[ .,!?]"
    }
  },
  "char_filter": {
    "emoticons":{
      "type":"mapping",
      "mappings" : [ 
        ":) => happy",
        ":( => sad"
      ]
    }
  },
  "filter": {
    "english_stop":{
      "type":"stop",
      "stopwords":"_english_"
    }
  }
}
}
}

ES Mapping ，1 欄位型別
2020-11-24
APP型別
ES 筆記十九：Query & Filtering 與多字串多欄位查詢
2019-11-10
筆記Filter字串
ES 筆記二十：單字串多欄位查詢：Dis Max Query
2019-11-11
筆記字串
ES 筆記二十一：單字串多欄位查詢: Multi Match
2019-11-14
筆記字串
DedeCMS的checkbox多選欄位自定義取值的方法
2020-06-16
ES 筆記六：通過 Analyzer 進行分詞
2019-10-15
筆記分詞
django admin中增加自定義超連結欄位
2024-04-09
Django
自定義元件-純資料欄位
2024-10-22
元件
Request 增加自定義欄位的方式
2021-11-26
ABP AutoMapper與自定義Mapping
2022-12-14
APP
多型關聯自定義的型別欄位的處理
2020-08-04
多型型別
PhpCms自定義欄位的使用說明
2020-06-16
PHP
laravel model自定義軟刪除欄位
2021-11-16
Laravel
Mybatis-plus排除自定義欄位不查詢
2020-11-22
MyBatis
samba 基本配置及自定義控制
2018-08-14
Samba
Laravel6：自定義多欄位登入，使用者名稱，郵箱等
2020-12-11
Laravel
sap新總賬中 CodingBlock客戶化自定義新欄位方法
2020-01-09
BloC
使用欄位格式化來自定義SharePoint（八）
2018-10-15
使用欄位格式化來自定義SharePoint（七）
2018-10-08
使用欄位格式化來自定義SharePoint（四）
2018-09-10
使用欄位格式化來自定義SharePoint（二）
2018-08-27
使用欄位格式化來自定義SharePoint（五）
2018-09-17
使用欄位格式化來自定義SharePoint（一）
2018-08-20
使用欄位格式化來自定義SharePoint（六）
2018-09-25
使用欄位格式化來自定義SharePoint（三）
2018-09-03
ES 筆記四十三：文件的父子關係
2020-01-12
筆記
ES 筆記三十三：分片及其生命週期
2019-12-21
筆記
ES 筆記十二：顯示 Mapping 設定與常見引數
2019-10-22
筆記APP
SpringBoot + Spring Security 學習筆記（一）自定義基本使用及個性化登入配置
2019-04-14
Spring Boot筆記
ES6、ES7、ES8特性-學習筆記（一）
2019-03-04
筆記
ElasticSearch7.3學習(十五)----中文分詞器(IK Analyzer)及自定義詞庫
2022-03-28
Elasticsearch中文分詞
ES7學習筆記（十三）GEO位置搜尋
2020-05-29
筆記
升級後欄位引數有自定義函式失效
2019-05-11
函式
WordPress自定義欄位獲取get_post_meta函式
2022-08-05
函式
mysql修改表欄位學習筆記
2018-07-25
MySql筆記
帝國CMS欄目管理增加自定義欄位值的為空判斷
2024-11-30
es 更新指定欄位的方法
2020-12-30
ASP.NET MVC 學習筆記-7.自定義配置資訊
2018-05-18
ASP.NETMVC筆記

ES 筆記十三：多欄位特性及 Mapping 中配置自定義 Analyzer

相關文章