一、倒排索引
1. 構建倒排索引
例如說有下面兩個句子doc1,doc2
doc1:I really liked my small dogs, and I think my mom also liked them.
doc2:He never liked any dogs, so I hope that my mom will not expect me to liked him.
首先進行英文分詞,這個階段就是初步的倒排索引的建立
term | doc1 | doc2 |
---|---|---|
I | * | * |
really | * | |
liked | * | * |
my | * | * |
small | * | |
dogs | * | |
and | * | |
think | * | |
mom | * | * |
also | * | |
them | * | |
He | * | |
never | * | |
any | * | |
so | * | |
hope | * | |
that | * | |
will | * | |
not | * | |
expect | * | |
me | * | |
to | * | |
him | * |
接下來就是搜尋,假如說搜尋為關鍵詞為"mother like little dog",把關鍵詞分詞為mother like little dog四個詞進行搜尋,會發現搜不出來結果。這不是我們想要的結果。但是對於mom來說,它與mother互為同義詞。在我們人類看來這兩個詞代表的意思就是一樣。所以能想到的操作就是能不能讓這兩個詞代表的含義一樣呢?這就是對詞語進行標準化操作了。
2. 重建倒排索引
normalization正規化,建立倒排索引的時候,會執行一個操作,也就是說對拆分出的各個單詞進行相應的處理,以提升後面搜尋的時候能夠搜尋到相關聯的文件的概率。比如說時態的轉換,單複數的轉換,同義詞的轉換,大小寫的轉換等
mom ―> mother
liked ―> like
small ―> little
dogs ―> dog
重新建立倒排索引,加入normalization,重建後的倒排索引如下
word | doc1 | doc2 | normalization |
---|---|---|---|
I | * | * | |
really | * | ||
like | * | * | liked ―> like |
my | * | * | |
little | * | small ―> little | |
dog | * | dogs ―> dog | |
and | * | ||
think | * | ||
mother | * | * | mom ―> mother |
also | * | ||
them | * | ||
He | * | ||
never | * | ||
any | * | ||
so | * | ||
hope | * | ||
that | * | ||
will | * | ||
not | * | ||
expect | * | ||
me | * | ||
to | * | ||
him | * |
3. 重新搜尋
再次用mother liked little dog搜尋,就可以搜尋到了。對搜尋條件經行分詞 normalization
mother -》mom
liked -》like
little -》small
dog -》dogs
這樣的話doc1和doc2都會搜尋出來
二、分詞器 analyzer
1. 什麼是分詞器 analyzer
作用:簡單來說就是切分詞語。給你一段句子,然後將這段句子拆分成一個一個的單個的單詞,同時對每個單詞進行normalization(時態轉換,單複數轉換)
normalization的好處就是提升召回率(recall)
recall:搜尋的時候,增加能夠搜尋到的結果的數量
analyzer 組成部分:
- character filter:在一段文字進行分詞之前,先進行預處理,比如說最常見的就是,過濾html標籤(hello --> hello),& --> and(I&you --> I and you)
- tokenizer:分詞,hello you and me --> hello, you, and, me
- token filter:lowercase(小寫轉換),stop word(去除停用詞),synonym(同義詞處理),例如:dogs --> dog,liked --> like,Tom --> tom,a/the/an --> 幹掉,mother --> mom,small --> little
一個分詞器,很重要,將一段文字進行各種處理,最後處理好的結果才會拿去建立倒排索引。
2. 內建分詞器的介紹
例句:Set the shape to semi-transparent by calling set_trans(5)
standard analyzer標準分詞器:set, the, shape, to, semi, transparent, by, calling, set_trans, 5(預設的是standard)
simple analyzer簡單分詞器:set, the, shape, to, semi, transparent, by, calling, set, trans
whitespace analyzer:Set, the, shape, to, semi-transparent, by, calling, set_trans(5)
language analyzer(特定的語言的分詞器,比如說,english,英語分詞器):set, shape, semi, transpar, call, set_tran, 5
官方文件:https://www.elastic.co/guide/en/elasticsearch/reference/7.4/analysis-analyzers.html
三、測試分詞器
GET /_analyze
{
"analyzer": "standard",
"text": "Text to analyze 80"
}
返回值:
{
"tokens" : [
{
"token" : "text",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "to",
"start_offset" : 5,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "analyze",
"start_offset" : 8,
"end_offset" : 15,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "80",
"start_offset" : 16,
"end_offset" : 18,
"type" : "<NUM>",
"position" : 3
}
]
}
token:實際儲存的term 關鍵字
position:在此詞條在原文字中的位置
start_offset/end_offset:字元在原始字串中的位置