Elasticsearch Analysis 分析器
Elasticsearch Analysis 分析器
- Analysis —文字分析是把全文字轉換一系列單詞(term/token)的過程,也叫分詞
- Analysis是透過Analyzer來實現的
- 可使用 Elasticsearch 內建的分析器/或者按需定製化分析器
- 除了在資料寫入時轉換詞條,匹配Query語句時候也需要用相同的分析器對查詢語句進行分析
Analyzer 分析器組成
分詞器是專門處理分詞的元件,由三部分組成
- Character Filters(針對原始文字處理,例如去除HTML)
- Tokenizer 安裝規則分詞
- Token Filter 將切分的單詞進行加工、小寫,刪除stopwords,增加同義詞
使用 Analyzer 分析器進行分詞
analyzer 分析器:
- Simple Analyzer – 按照非字母切分(符號被過濾),小寫處理
- Stop Analyzer – 小寫處理,停用詞過濾(the,a,is)
- Whitespace Analyzer – 按照空格切分,不轉小寫
- Keyword Analyzer – 不分詞,直接將輸入當作輸出
- Patter Analyzer – 正規表示式,預設 W+ (非字元分隔)
- Language – 提供了30多種常見語言的分詞器
檢視不同 analyzer 分析器的效果
standard 標準分析器(預設)
GET _analyze
{
"analyzer": "standard",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
=================== 結果 V ===================
{
"tokens" : [
{
"token" : "2",
"start_offset" : 0,
"end_offset" : 1,
"type" : "" ,
"position" : 0
},
{
"token" : "running",
"start_offset" : 2,
"end_offset" : 9,
"type" : "" ,
"position" : 1
},
......
{
"token" : "evening",
"start_offset" : 62,
"end_offset" : 69,
"type" : "" ,
"position" : 12
}
]
}
GET _analyze
{
"analyzer": "stop",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
=================== 結果 V ===================
{
"tokens" : [
{
"token" : "running",
"start_offset" : 2,
"end_offset" : 9,
"type" : "word",
"position" : 0
},
{
"token" : "quick",
"start_offset" : 10,
"end_offset" : 15,
"type" : "word",
"position" : 1
},
......
{
"token" : "evening",
"start_offset" : 62,
"end_offset" : 69,
"type" : "word",
"position" : 11
}
]
}
更多分詞器例子
#simpe
GET _analyze
{
"analyzer": "simple",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
GET _analyze
{
"analyzer": "stop",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
#stop
GET _analyze
{
"analyzer": "whitespace",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
#keyword
GET _analyze
{
"analyzer": "keyword",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
GET _analyze
{
"analyzer": "pattern",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
#english
GET _analyze
{
"analyzer": "english",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
POST _analyze
{
"analyzer": "icu_analyzer",
"text": "他說的確實在理”"
}
POST _analyze
{
"analyzer": "standard",
"text": "他說的確實在理”"
}
POST _analyze
{
"analyzer": "icu_analyzer",
"text": "這個蘋果不大好吃"
}
需要注意的是,
icu_analyzer
分析器; 包括ik
分析器; 並非 Elasticsearch 7.8.0 自帶分析器.
需要執行命令:./bin/elasticsearch-plugin install analysis-icu
自行安裝並重啟 elasticsearch 才能使用
更多中文分詞
ik
支援自定義詞庫,支援熱更新分詞
gitee.com/mirrors/elasticsearch-analysis-ik?_from=gitee_search
THULAC
清華大學自然語言處理和社會人文計算實驗室的一套中文分詞器
gitee.com/puremilk/THULAC-Python?_from=gitee_search
相關閱讀
- www.elastic.co/guide/en/elasticsearch/reference/current/analyzer-anatomy.html
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/430/viewspace-2807063/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- HanLP Analysis for ElasticsearchHanLPElasticsearch
- 09.elasticsearch-analysis-normalizer應用ElasticsearchORM
- Maven編譯elasticsearch-analysis-ik報錯Maven編譯Elasticsearch
- RISK ANALYSIS
- 詞法分析器詞法分析
- Flutter Analysis OptionsFlutter
- Lex詞法分析器詞法分析
- Statistics and Data Analysis for BioinformaticsORM
- Web Scraping & Data AnalysisWebAPI
- pytorch contributing - matmul analysisPyTorch
- An Analysis of Sequential Recommendation Datasets
- A Security Analysis Of Browser Extensions
- MSE 609 Quantitative Data Analysis
- Slither: A Static Analysis Framework For SmartFramework
- Monkey 01 lexer 詞法分析器詞法分析
- ECON705 Housing Affordability Analysis
- Exercise 5: Field data acquisition and analysisUI
- Problems in Mathematical Analysis (American First Edition)
- Analysis of Set Union Algorithms 題解Go
- Fishing for Hackers: Analysis of a Linux Server AttackLinuxServer
- SAP QM Certificate of Analysis – Incoming Certificate
- Oracle Respones-Time Analysis ReportsOracle
- Pycharm——安裝mypy(靜態分析器)PyCharm
- memray: Python的記憶體分析器Python記憶體
- EBIS4043 Big Data Analysis and ApplicationsAPP
- CS209A Analysis of the Olympic Historical Dataset
- what-i-learned-from-analysis-vuepressVue
- R語言-Survival analysis(生存分析)R語言
- 【編譯原理】手工打造語法分析器編譯原理語法分析
- 【編譯原理】手工打造詞法分析器編譯原理詞法分析
- 【Elasticsearch】Elasticsearch 索引模板Elasticsearch索引
- python_for_data_analysis_2nd_chinese_versionPython
- ME5701 Linear stability analysis of Mathieu equation
- In-depth analysis of the comparison between AT and XA of distributed transactions
- Linux Troubleshooting 超實用系列 - Disk AnalysisLinux
- 線性判別分析(Linear Discriminant Analysis)NaN
- In the meantime you can read the IGN analysis of Madden 22
- 流式細胞分析器:FlowJo For Mac啟用版Mac