1、預設的分詞器
關於分詞器,前面的部落格已經有介紹了,連結:ElasticSearch7.3 學習之倒排索引揭祕及初識分詞器(Analyzer)。這裡就只介紹預設的分詞器standard analyzer
2、 修改分詞器的設定
首先自定義一個分詞器es_std。啟用english停用詞token filter
PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"es_std": {
"type": "standard",
"stopwords": "_english_"
}
}
}
}
}
返回:
接下來開始測試兩種不同的分詞器,首先是預設的分詞器
GET /my_index/_analyze
{
"analyzer": "standard",
"text": "a dog is in the house"
}
返回結果
{ "tokens" : [ { "token" : "a", "start_offset" : 0, "end_offset" : 1, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "dog", "start_offset" : 2, "end_offset" : 5, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "is", "start_offset" : 6, "end_offset" : 8, "type" : "<ALPHANUM>", "position" : 2 }, { "token" : "in", "start_offset" : 9, "end_offset" : 11, "type" : "<ALPHANUM>", "position" : 3 }, { "token" : "the", "start_offset" : 12, "end_offset" : 15, "type" : "<ALPHANUM>", "position" : 4 }, { "token" : "house", "start_offset" : 16, "end_offset" : 21, "type" : "<ALPHANUM>", "position" : 5 } ] }
可以看到就是簡單的按單詞進行拆分,在接下來測試上面自定義的一個分詞器es_std
GET /my_index/_analyze
{
"analyzer": "es_std",
"text":"a dog is in the house"
}
返回:
{
"tokens" : [
{
"token" : "dog",
"start_offset" : 2,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "house",
"start_offset" : 16,
"end_offset" : 21,
"type" : "<ALPHANUM>",
"position" : 5
}
]
}
可以看到結果只有兩個單詞了,把停用詞都給去掉了。
3、定製化自己的分詞器
首先刪除掉上面建立的索引
DELETE my_index
然後執行下面的語句。簡單說下下面的規則吧,首先去除html標籤,把&轉換成and,然後採用standard進行分詞,最後轉換成小寫字母及去掉停用詞a the,建議讀者好好看看,下面我也會對這個分詞器進行測試。
PUT /my_index
{
"settings": {
"analysis": {
"char_filter": {
"&_to_and": {
"type": "mapping",
"mappings": [
"&=> and"
]
}
},
"filter": {
"my_stopwords": {
"type": "stop",
"stopwords": [
"the",
"a"
]
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"char_filter": [
"html_strip",
"&_to_and"
],
"tokenizer": "standard",
"filter": [
"lowercase",
"my_stopwords"
]
}
}
}
}
}
返回
{
"acknowledged" : true,
"shards_acknowledged" : true,
"index" : "my_index"
}
老規矩,測試這個分詞器
GET /my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "tom&jerry are a friend in the house, <a>, HAHA!!"
}
結果如下:
{
"tokens" : [
{
"token" : "tomandjerry",
"start_offset" : 0,
"end_offset" : 9,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "are",
"start_offset" : 10,
"end_offset" : 13,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "friend",
"start_offset" : 16,
"end_offset" : 22,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "in",
"start_offset" : 23,
"end_offset" : 25,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "house",
"start_offset" : 30,
"end_offset" : 35,
"type" : "<ALPHANUM>",
"position" : 6
},
{
"token" : "haha",
"start_offset" : 42,
"end_offset" : 46,
"type" : "<ALPHANUM>",
"position" : 7
}
]
}
最後我們可以在實際使用時設定某個欄位使用自定義分詞器,語法如下:
PUT /my_index/_mapping/
{
"properties": {
"content": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}