基於SolrCloud的內容搜尋和熱點推送

聽雲APM發表於2016-05-07

原文出自【聽雲技術部落格】:http://blog.tingyun.com/web/article/detail/556

什麼是熱點

我認為熱點有時效性和受眾面

使用者關注從低到高再到低的內容 。有公共熱點和分類熱點。例如醫遼養老全民關注,科技汽車等只有特定的人群關注。

推送的條件

搜尋頻次達到一定數量

單位時間內搜尋頻次上升一定倍數。例如1000一週內達到100萬,這樣就達到推送標準了。

問題背景

自動提示功能是所有搜尋應用的標準配置,目的主要有兩個

1.提供更好的使用者體驗,降低輸入的複雜度。

2.避免使用者輸入錯誤的詞,將使用者的輸入引導向正確的詞。弱化同義詞處理的重要性

需求分析

  • 海量資料的快速搜尋

  • 支援自動提示功能

  • 支援自動糾錯

  • 在輸入舌尖時,要自動提示舌尖上的中國,舌尖上的小吃等

  • 支援拼音和縮寫筆錯拼例如shejian sjsdzg shenjianshang shejiashang

  • 查詢記錄,按照使用者的搜尋歷史優先上排查詢頻率最高的。

  • 分類熱點進行推送

解決方案

索引

Solr的全檔案檢索有兩步

1、建立索引

2、搜尋索引

索引是如何建立的又是如何查詢的?

Solr採用的一種策略是倒排索引,什麼是倒排索引。Solr的倒排索引是如何實現的

大家參考以下三篇文章寫的很全。

http://www.cnblogs.com/ forfuture1978/p/3940965.html

http://www.cnblogs.com/ forfuture1978/p/3944583.html

http://www.cnblogs.com/ forfuture1978/p/3945755.html

漢字轉拼音

使用者輸入的關鍵字可能是漢字、數字,英文,拼音,特殊字元等等,由於需要實現拼音提示,我們需要把漢字轉換成拼音,java中考慮使用pinyin4j元件實現轉換。

拼音縮寫提取

考慮到需要支援拼音縮寫,漢字轉換拼音的過程中,順便提取出拼音縮寫,如“shejian”,--->"sj”。

自動提示功能

方案一:

在solr中內建了智慧提示功能,叫做Suggest模組,該模組可選擇基於提示詞文字做智慧提示,還支援通過針對索引的某個欄位建立索引詞庫做智慧提示。使用說明http://wiki.apache.org/solr/Suggester

Suggest存在一些問題,它完全使用freq排序演算法,返回的結果完全基於索引中出現的次數,沒有相容搜尋的頻率,但是我們必須要得到搜尋的頻率。

我們可以定製SuggestWordScoreComparator重寫compare(SuggestWord first, SuggestWord second)方法來實現自己的排序演算法。筆者使用了搜尋頻率和freq權重7:3的方式

方案二:

我們考慮專門為關鍵字建立一個索引collection,利用solr字首查詢實現。solr中的copyField能很好解決我們同時索引多個欄位(漢字、pinyin, abbre)的需求,且field的multiValued屬性設定為true時能解決同一個關鍵字的多音字組合問題。配置如下:

schema.xml:

<field name="keyword" type="string" indexed="true" stored="true" />  
<field name="pinyin" type="string" indexed="true" stored="false" multiValued="true"/>
<field name="abbre" type="string" indexed="true" stored="false" multiValued="true"/>
<field name="kwfreq" type="int" indexed="true" stored="true" />
<field name="_version_" type="long" indexed="true" stored="true"/>
<field name="suggest" type="suggest_text" indexed="true" stored="false" multiValued="true" />

<!--multiValued表示欄位是多值的-->

<uniqueKey>keyword</uniqueKey>
<defaultSearchField>suggest</defaultSearchField>

<copyField source="kw" dest="suggest" />
<copyField source="pinyin" dest="suggest" />
<copyField source="abbre" dest="suggest" />

<!--suggest_text-->

<fieldType name="suggest_text" class="solr.TextField" positionIncrementGap="100"autoGeneratePhraseQueries="true">
<analyzer type="index">
        <tokenizer class="solr.KeywordTokenizerFactory" />
        &lt;filter class="solr.SynonymFilterFactory" 
                synonyms="synonyms.txt" 
                ignoreCase="true" 
                expand="true" />
                 <filter class="solr.StopFilterFactory" 
                ignoreCase="true" 
                words="stopwords.txt" 
                enablePositionIncrements="true" />
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt" />
</analyzer>
<analyzer type="query">
        <tokenizer class="solr.KeywordTokenizerFactory" />
        <filter class="solr.StopFilterFactory" 
                ignoreCase="true" 
                words="stopwords.txt" 
                enablePositionIncrements="true" />
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt" />
</analyzer>

SpellCheckComponent拼寫糾錯

拼寫檢查的核心是求相似度

兩個給定字串S1和S2的Jaro Distance為:

enter image description here

  • m是匹配的字元數;

  • t是換位的數目。

兩個分別來自S1和S2的字元如果相距不超過enter image description here時,我們就認為這兩個字串是匹配的;而這些相互匹配的字元則決定了換位的數目t,簡單來說就是不同順序的匹配字元的數目的一半即為換位的數目t,舉例來說,MARTHA與MARH他的字元都是匹配的,但是這些匹配的字元中,T和H要換位才能把MARTHA變為MARHTA,那麼T和H就是不同的順序的匹配字元,t=2/2=1.

那麼這兩個字串的Jaro Distance即為:

enter image description here

而Jaro-Winkler則給予了起始部分就相同的字串更高的分數,他定義了一個字首p,給予兩個字串,如果字首部分有長度為 的部分相同,則Jaro-Winkler Distance為:

  • dj是兩個字串的Jaro Distance

  • enter image description here是字首的相同的長度,但是規定最大為4

  • p則是調整分數的常數,規定不能超過0.25,不然可能出現dw大於1的情況,Winkler將這個常數定義為0.1

這樣,上面提及的MARTHA和MARH他的Jaro-Winkler Distance為:

dw = 0.944 + (3 * 0.1(1 − 0.944)) = 0.961

以上資料來源於維基百科:

http://en.wikipedia.org/wiki/Jaro-Winkler_distance

solr內建了自動糾錯的實現spellchecker

我們來分析一下spellchecker的原始碼

package org.apache.lucene.search.spell;

import java.io.Closeable;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Comparator;
import java.util.Iterator;
import java.util.List;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.document.FieldType;
import org.apache.lucene.document.StringField;
import org.apache.lucene.index.AtomicReader;
import org.apache.lucene.index.AtomicReaderContext;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.FieldInfo.IndexOptions;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.Terms;
import org.apache.lucene.index.TermsEnum;
import org.apache.lucene.search.BooleanClause;
import org.apache.lucene.search.BooleanClause.Occur;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.AlreadyClosedException;
import org.apache.lucene.store.Directory;
import org.apache.lucene.util.BytesRef;
import org.apache.lucene.util.BytesRefIterator;
import org.apache.lucene.util.Version;

public class SpellChecker implements Closeable {
/*
* DEFAULT_ACCURACY表示預設的最小分數
* SpellCheck會對字典裡的每個詞與使用者輸入的搜尋關鍵字進行一個相似度打分
* 預設該值是0.5,相似度分值範圍是0到1之間,數字越大表示越相似。
*/
public static final float DEFAULT_ACCURACY = 0.5F;
public static final String F_WORD = "word";
//拼寫索引目錄
Directory spellIndex;

//字首ngram權重
private float bStart = 2.0F;
//字尾ngram的權重
private float bEnd = 1.0F;
//ngram演算法:該演算法基於這樣一種假設,第n個詞的出現只與前面N-1個詞相關,而與其它任何詞都不相關,整句的概率就是各個詞出現概率的乘積。
//簡單說ngram就是按定長來分割字串成多個Term 例如 abcde 3ngram分會得到 abc bcd cde ,4ngram會得到abcd bcde
//索引的查詢器物件
private IndexSearcher searcher;
private final Object searcherLock = new Object();
private final Object modifyCurrentIndexLock = new Object();
private volatile boolean closed = false;
private float accuracy = 0.5F;
private StringDistance sd;
private Comparator<SuggestWord> comparator;

public SpellChecker(Directory spellIndex, StringDistance sd) throws IOException {
this(spellIndex, sd, SuggestWordQueue.DEFAULT_COMPARATOR);
}

public SpellChecker(Directory spellIndex) throws IOException {
this(spellIndex, new LevensteinDistance());
}

public SpellChecker(Directory spellIndex, StringDistance sd, Comparator<SuggestWord> comparator)
throws IOException {
setSpellIndex(spellIndex);
setStringDistance(sd);
this.comparator = comparator;
}

public void setSpellIndex(Directory spellIndexDir) throws IOException {
synchronized (this.modifyCurrentIndexLock) {
ensureOpen();
if (!DirectoryReader.indexExists(spellIndexDir)) {
IndexWriter writer = new IndexWriter(spellIndexDir,
new IndexWriterConfig(Version.LUCENE_CURRENT, null));

writer.close();
}
swapSearcher(spellIndexDir);
}
}

public void setComparator(Comparator<SuggestWord> comparator) {
this.comparator = comparator;
}

public Comparator<SuggestWord> getComparator() {
return this.comparator;
}

public void setStringDistance(StringDistance sd) {
this.sd = sd;
}

public StringDistance getStringDistance() {
return this.sd;
}

public void setAccuracy(float acc) {
this.accuracy = acc;
}

public float getAccuracy() {
return this.accuracy;
}

public String[] suggestSimilar(String word, int numSug) throws IOException {
return suggestSimilar(word, numSug, null, null, SuggestMode.SUGGEST_WHEN_NOT_IN_INDEX);
}

public String[] suggestSimilar(String word, int numSug, float accuracy) throws IOException {
return suggestSimilar(word, numSug, null, null, SuggestMode.SUGGEST_WHEN_NOT_IN_INDEX, accuracy);
}

public String[] suggestSimilar(String word, int numSug, IndexReader ir, String field, SuggestMode suggestMode)
throws IOException {
return suggestSimilar(word, numSug, ir, field, suggestMode, this.accuracy);
}

/*
* 核心重點
*/
public String[] suggestSimilar(String word, int numSug, IndexReader ir, String field, SuggestMode suggestMode,
float accuracy) throws IOException {
IndexSearcher indexSearcher = obtainSearcher();
try {
if ((ir == null) || (field == null)) {
//SuggestMode.SUGGEST_ALWAYS永遠建議
suggestMode = SuggestMode.SUGGEST_ALWAYS;
}
if (suggestMode == SuggestMode.SUGGEST_ALWAYS) {
ir = null;
field = null;
}
int lengthWord = word.length();

int freq = (ir != null) && (field != null) ? ir.docFreq(new Term(field, word)) : 0;
int goalFreq = suggestMode == SuggestMode.SUGGEST_MORE_POPULAR ? freq : 0;
// freq > 0表示用記搜尋的關鍵詞在SuggestMode.SUGGEST_WHEN_NOT_IN_INDEX為空,才提供建議
if ((suggestMode == SuggestMode.SUGGEST_WHEN_NOT_IN_INDEX) && (freq > 0)) {
return new String[] { word };
}
BooleanQuery query = new BooleanQuery();
for (int ng = getMin(lengthWord); ng <= getMax(lengthWord); ng++) {
String key = "gram" + ng;

String[] grams = formGrams(word, ng);
if (grams.length != 0) {
if (this.bStart > 0.0F) {
add(query, "start" + ng, grams[0], this.bStart);
}
if (this.bEnd > 0.0F) {
add(query, "end" + ng, grams[(grams.length - 1)], this.bEnd);
}
for (int i = 0; i < grams.length; i++) {
add(query, key, grams[i]);
}
}
}
int maxHits = 10 * numSug;

ScoreDoc[] hits = indexSearcher.search(query, null, maxHits).scoreDocs;

SuggestWordQueue sugQueue = new SuggestWordQueue(numSug, this.comparator);

int stop = Math.min(hits.length, maxHits);
SuggestWord sugWord = new SuggestWord();
for (int i = 0; i < stop; i++) {
sugWord.string = indexSearcher.doc(hits[i].doc).get("word");
if (!sugWord.string.equals(word)) {
sugWord.score = this.sd.getDistance(word, sugWord.string);
//求關鍵字和索引中的Term的相似度
if (sugWord.score >= accuracy) {
if ((ir != null) && (field != null)) {
sugWord.freq = ir.docFreq(new Term(field, sugWord.string));
//如果相似度小於設定的預設值則也不返回
if (((suggestMode == SuggestMode.SUGGEST_MORE_POPULAR) && (goalFreq > sugWord.freq))
|| (sugWord.freq < 1)) {
}
} else {
//條件符合那就把當前索引中的Term存入拼寫建議佇列中
//如果佇列滿了則把佇列頂部的score(即相似度)快取到accuracy即該值就表示了當前最小的相似度值,
//當佇列滿了,把相似度最小的移除
sugQueue.insertWithOverflow(sugWord);
if (sugQueue.size() == numSug) {
accuracy = ((SuggestWord) sugQueue.top()).score;
}
sugWord = new SuggestWord();
}
}
}
}
String[] list = new String[sugQueue.size()];
for (int i = sugQueue.size() - 1; i >= 0; i--) {
list[i] = ((SuggestWord) sugQueue.pop()).string;
}
return list;
} finally {
releaseSearcher(indexSearcher);
}
}

private static void add(BooleanQuery q, String name, String value, float boost) {
Query tq = new TermQuery(new Term(name, value));
tq.setBoost(boost);
q.add(new BooleanClause(tq, BooleanClause.Occur.SHOULD));
}

private static void add(BooleanQuery q, String name, String value) {
q.add(new BooleanClause(new TermQuery(new Term(name, value)), BooleanClause.Occur.SHOULD));
}

/*
* 根據ng的長度對text字串進行 ngram分詞
*/

private static String[] formGrams(String text, int ng) {
int len = text.length();
String[] res = new String[len - ng + 1];
for (int i = 0; i < len - ng + 1; i++) {
res[i] = text.substring(i, i + ng);
}
return res;
}

public void clearIndex() throws IOException {
synchronized (this.modifyCurrentIndexLock) {
ensureOpen();
Directory dir = this.spellIndex;
IndexWriter writer = new IndexWriter(dir,
new IndexWriterConfig(Version.LUCENE_CURRENT, null).setOpenMode(IndexWriterConfig.OpenMode.CREATE));

writer.close();
swapSearcher(dir);
}
}

public boolean exist(String word) throws IOException {
IndexSearcher indexSearcher = obtainSearcher();
try {
return indexSearcher.getIndexReader().docFreq(new Term("word", word)) > 0;
} finally {
releaseSearcher(indexSearcher);
}
}

/*
* 這個比較難理解
*/ 
public final void indexDictionary(Dictionary dict, IndexWriterConfig config, boolean fullMerge)
throws IOException
{
synchronized (this.modifyCurrentIndexLock)
{
ensureOpen();
Directory dir = this.spellIndex;
IndexWriter writer = new IndexWriter(dir, config);
IndexSearcher indexSearcher = obtainSearcher();
List<TermsEnum> termsEnums = new ArrayList();
//讀取索引目錄
IndexReader reader = this.searcher.getIndexReader();
if (reader.maxDoc() > 0) {
//載入word域上的所有Term存入TermEnum集合
for (AtomicReaderContext ctx : reader.leaves())
{
Terms terms = ctx.reader().terms("word");

if (terms != null) {
termsEnums.add(terms.iterator(null));
}
}
}
boolean isEmpty = termsEnums.isEmpty();
try
{
//載入字典檔案
BytesRefIterator iter = dict.getEntryIterator();
BytesRef currentTerm;
//遍歷字典檔案裡的每個詞
while ((currentTerm = iter.next()) != null)
{
String word = currentTerm.utf8ToString();
int len = word.length();
if (len >= 3)
{
if (!isEmpty)
{
Iterator i$ = termsEnums.iterator();
for (;;)
{

if (!i$.hasNext()) {
break label235;
}
//遍歷索引目錄裡word域上的每個Term
TermsEnum te = (TermsEnum)i$.next();

if (te.seekExact(currentTerm)) {
break;
}
}
}
label235:
//通過ngram分成多個Term
Document doc = createDocument(word, getMin(len), getMax(len));
//將字典檔案裡當前詞寫入索引
writer.addDocument(doc);
}
}
}
finally
{
releaseSearcher(indexSearcher);
}
if (fullMerge) {
writer.forceMerge(1);
}
writer.close();

swapSearcher(dir);
}
}

private static int getMin(int l) {
if (l > 5) {
return 3;
}
if (l == 5) {
return 2;
}
return 1;
}

private static int getMax(int l) {
if (l > 5) {
return 4;
}
if (l == 5) {
return 3;
}
return 2;
}

private static Document createDocument(String text, int ng1, int ng2) {
Document doc = new Document();

Field f = new StringField("word", text, Field.Store.YES);
doc.add(f);
addGram(text, doc, ng1, ng2);
return doc;
}

private static void addGram(String text, Document doc, int ng1, int ng2) {
int len = text.length();
for (int ng = ng1; ng <= ng2; ng++) {
String key = "gram" + ng;
String end = null;
for (int i = 0; i < len - ng + 1; i++) {
String gram = text.substring(i, i + ng);
FieldType ft = new FieldType(StringField.TYPE_NOT_STORED);
ft.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS);
Field ngramField = new Field(key, gram, ft);

doc.add(ngramField);
if (i == 0) {
Field startField = new StringField("start" + ng, gram, Field.Store.NO);
doc.add(startField);
}
end = gram;
}
if (end != null) {
Field endField = new StringField("end" + ng, end, Field.Store.NO);
doc.add(endField);
}
}
}

private IndexSearcher obtainSearcher() {
synchronized (this.searcherLock) {
ensureOpen();
this.searcher.getIndexReader().incRef();
return this.searcher;
}
}

private void releaseSearcher(IndexSearcher aSearcher) throws IOException {
aSearcher.getIndexReader().decRef();
}

private void ensureOpen() {
if (this.closed) {
throw new AlreadyClosedException("Spellchecker has been closed");
}
}

public void close() throws IOException {
synchronized (this.searcherLock) {
ensureOpen();
this.closed = true;
if (this.searcher != null) {
this.searcher.getIndexReader().close();
}
this.searcher = null;
}
}

private void swapSearcher(Directory dir) throws IOException {
IndexSearcher indexSearcher = createSearcher(dir);
synchronized (this.searcherLock) {
if (this.closed) {
indexSearcher.getIndexReader().close();
throw new AlreadyClosedException("Spellchecker has been closed");
}
if (this.searcher != null) {
this.searcher.getIndexReader().close();
}
this.searcher = indexSearcher;
this.spellIndex = dir;
}
}
IndexSearcher createSearcher(Directory dir) throws IOException {
return new IndexSearcher(DirectoryReader.open(dir));
}

boolean isClosed() {
return this.closed;
}
}

以上我們就建立的一個符合要求的檢索功能,然後再從中篩選熱點,根據使用者畫像分類推送就可以了。

相關文章