Solr學習筆記(三):分詞

weixin_33935777發表於2016-11-14

Analyzer(分析器)

https://cwiki.apache.org/confluence/display/solr/Understanding+Analyzers,+Tokenizers,+and+Filters

分詞器就是將句子分成單個的詞,過濾器就是對分詞的結果進行篩選,例如中文中將“的”“呀”這些對句子主體意思影響不大的詞刪除,英語中類似的就是"is","a"等等。

分析器包括兩個部分:tokenizer(分詞器)和filter(分詞過濾器,它們將按照所列的順序發生作用)。for example:

<fieldType name="text_ik_analysis" class="solr.TextField" sortMissingLast="true"  omitNorms="true" autoGeneratePhraseQueries="false">
        <analyzer type="index">
                <tokenizer class="org.wltea.analyzer.lucene.IKTokenizerFactory" useSmart="true"/>
                <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
                <filter class="solr.LengthFilterFactory" min="2" max="20" />
                <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
        </analyzer>
        <analyzer type="query">
               <tokenizer class="org.wltea.analyzer.lucene.IKTokenizerFactory" useSmart="true"/>
               <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
               <filter class="solr.LengthFilterFactory" min="2" max="20" />
               <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
        </analyzer>
</fieldType>

Tokenizer(分詞器)

常見的分詞器有:

  • KeywordTokenizerFactory:不管什麼內容,整句當成一個關鍵字
  • LetterTokenizerFactory:根據字母來分詞,拋棄非字母的部分,例如:"I can't" ==> "I", "can", "t"
  • WhitespaceTokenizerFactory:根據空格來分詞,例如:"I do" ==> "I", "do"
  • IKTokenizerFactory:IK分詞器

Filter(過濾器)

常見的過濾器:

  • LowerCaseFilterFactory:將大寫字母轉換成小寫,不處理非字母部分
  • SynonymFilterFactory:同義詞
  • LengthFilterFactory: 限定字元長度
  • RemoveDuplicatesTokenFilterFactory:移除重複文字

相關文章