一、高亮的一些問題
elasticsearch提供了三種高亮方式,前面我們已經簡單的瞭解了elasticsearch的高亮原理; 高亮處理跟實際使用查詢型別有十分緊密的關係,其中主要的一點就是muti term 查詢的重寫,例如wildcard、prefix等,由於查詢本身和高亮都涉及到查詢語句的重寫,如果兩者之間的重寫機制不同,那麼就可能會碰到以下情況
相同的查詢語句, 使用unified和fvh得到的高亮結果是不同的,甚至fvh Highlighter無任何高亮資訊返回;
二、資料環境
elasticsearch 8.0
PUT highlight_test
{
"mappings": {
"properties": {
"text":{
"type": "text",
"term_vector": "with_positions_offsets"
}
}
},
"settings": {
"number_of_replicas":0,
"number_of_shards": 1
}
}
PUT highlight_test/_doc/1
{
"name":"mango",
"text":"my name is mongo, i am test hightlight in elastic search"
}
三、muti term查詢重寫簡介
所謂muti term查詢就是查詢中並不是明確的關鍵字,而是需要elasticsearch重寫查詢語句,進一步明確關鍵字;以下查詢會涉及到muti term查詢重寫;
fuzzy
prefix
query_string
regexp
wildcard
以上查詢都支援rewrite引數,最終將查詢重寫為bool查詢或者bitset;
查詢重寫主要影響以下幾方面
重寫需要抓取哪些關鍵字以及抓取的數量;
抓取關鍵字的相關性計算方式;
查詢重寫支援以下引數選項
constant_score,預設值,如果需要抓取的關鍵字比較少,則重寫為bool查詢,否則抓取所有的關鍵字並重寫為bitset;直接使用boost引數作為文件score,一般term level的查詢的boost預設值為1;
constant_score_boolean,將查詢重寫為bool查詢,並使用boost引數作為文件的score,受到indices.query.bool.max_clause_count 限制,所以預設最多抓取1024個關鍵字;
scoring_boolean,將查詢重寫為bool查詢,並計算文件的相對權重,受到indices.query.bool.max_clause_count 限制,所以預設最多抓取1024個關鍵字;
top_terms_blended_freqs_N,抓取得分最高的前N個關鍵字,並將查詢重寫為bool查詢;此選項不受indices.query.bool.max_clause_count 限制;選擇命中文件的所有關鍵字中權重最大的作為文件的score;
top_terms_boost_N,抓取得分最高的前N個關鍵字,並將查詢重寫為bool查詢;此選項不受indices.query.bool.max_clause_count 限制;直接使用boost作為文件的score;
top_terms_N,抓取得分最高的前N個關鍵字,並將查詢重寫為bool查詢;此選項不受indices.query.bool.max_clause_count 限制;計算命中文件的相對權重作為評分;
三、wildcard查詢重寫分析
我們通過elasticsearch來檢視一下以下查詢語句的重寫邏輯;
{
"query":{
"wildcard":{
"text":{
"value":"m*"
}
}
}
}
通過查詢使用的欄位對映型別構建WildCardQuery,並使用查詢語句中配置的rewrite對應的MultiTermQuery.RewriteMethod;
//WildcardQueryBuilder.java
@Override
protected Query doToQuery(SearchExecutionContext context) throws IOException {
MappedFieldType fieldType = context.getFieldType(fieldName);
if (fieldType == null) {
throw new IllegalStateException("Rewrite first");
}
MultiTermQuery.RewriteMethod method = QueryParsers.parseRewriteMethod(rewrite, null, LoggingDeprecationHandler.INSTANCE);
return fieldType.wildcardQuery(value, method, caseInsensitive, context);
}
根據查詢語句中配置的rewrite,查詢對應的MultiTermQuery.RewriteMethod,由於我們沒有在wildcard查詢語句中設定rewrite引數,這裡直接返回null;
//QueryParsers.java
public static MultiTermQuery.RewriteMethod parseRewriteMethod(
@Nullable String rewriteMethod,
@Nullable MultiTermQuery.RewriteMethod defaultRewriteMethod,
DeprecationHandler deprecationHandler
) {
if (rewriteMethod == null) {
return defaultRewriteMethod;
}
if (CONSTANT_SCORE.match(rewriteMethod, deprecationHandler)) {
return MultiTermQuery.CONSTANT_SCORE_REWRITE;
}
if (SCORING_BOOLEAN.match(rewriteMethod, deprecationHandler)) {
return MultiTermQuery.SCORING_BOOLEAN_REWRITE;
}
if (CONSTANT_SCORE_BOOLEAN.match(rewriteMethod, deprecationHandler)) {
return MultiTermQuery.CONSTANT_SCORE_BOOLEAN_REWRITE;
}
int firstDigit = -1;
for (int i = 0; i < rewriteMethod.length(); ++i) {
if (Character.isDigit(rewriteMethod.charAt(i))) {
firstDigit = i;
break;
}
}
if (firstDigit >= 0) {
final int size = Integer.parseInt(rewriteMethod.substring(firstDigit));
String rewriteMethodName = rewriteMethod.substring(0, firstDigit);
if (TOP_TERMS.match(rewriteMethodName, deprecationHandler)) {
return new MultiTermQuery.TopTermsScoringBooleanQueryRewrite(size);
}
if (TOP_TERMS_BOOST.match(rewriteMethodName, deprecationHandler)) {
return new MultiTermQuery.TopTermsBoostOnlyBooleanQueryRewrite(size);
}
if (TOP_TERMS_BLENDED_FREQS.match(rewriteMethodName, deprecationHandler)) {
return new MultiTermQuery.TopTermsBlendedFreqScoringRewrite(size);
}
}
throw new IllegalArgumentException("Failed to parse rewrite_method [" + rewriteMethod + "]");
}
}
WildCardQuery繼承MultiTermQuery,直接呼叫rewrite方法進行重寫,由於我們沒有在wildcard查詢語句中設定rewrite引數,這裡直接使用預設的CONSTANT_SCORE_REWRITE;
//MultiTermQuery.java
protected RewriteMethod rewriteMethod = CONSTANT_SCORE_REWRITE;
@Override
public final Query rewrite(IndexReader reader) throws IOException {
return rewriteMethod.rewrite(reader, this);
}
可以看到CONSTANT_SCORE_REWRITE是直接使用的匿名類,rewrite方法返回的是MultiTermQueryConstantScoreWrapper的例項;
//MultiTermQuery.java
public static final RewriteMethod CONSTANT_SCORE_REWRITE =
new RewriteMethod() {
@Override
public Query rewrite(IndexReader reader, MultiTermQuery query) {
return new MultiTermQueryConstantScoreWrapper<>(query);
}
};
在以下方法中,首先會得到查詢欄位對應的所有term集合;
然後通過 query.getTermsEnum獲取跟查詢匹配的所有term集合;
最後根據collectTerms呼叫的返回值決定是否構建bool查詢還是bit set;
//MultiTermQueryConstantScoreWrapper.java
private WeightOrDocIdSet rewrite(LeafReaderContext context) throws IOException {
final Terms terms = context.reader().terms(query.field);
if (terms == null) {
// field does not exist
return new WeightOrDocIdSet((DocIdSet) null);
}
final TermsEnum termsEnum = query.getTermsEnum(terms);
assert termsEnum != null;
PostingsEnum docs = null;
final List<TermAndState> collectedTerms = new ArrayList<>();
if (collectTerms(context, termsEnum, collectedTerms)) {
// build a boolean query
BooleanQuery.Builder bq = new BooleanQuery.Builder();
for (TermAndState t : collectedTerms) {
final TermStates termStates = new TermStates(searcher.getTopReaderContext());
termStates.register(t.state, context.ord, t.docFreq, t.totalTermFreq);
bq.add(new TermQuery(new Term(query.field, t.term), termStates), Occur.SHOULD);
}
Query q = new ConstantScoreQuery(bq.build());
final Weight weight = searcher.rewrite(q).createWeight(searcher, scoreMode, score());
return new WeightOrDocIdSet(weight);
}
// Too many terms: go back to the terms we already collected and start building the bit set
DocIdSetBuilder builder = new DocIdSetBuilder(context.reader().maxDoc(), terms);
if (collectedTerms.isEmpty() == false) {
TermsEnum termsEnum2 = terms.iterator();
for (TermAndState t : collectedTerms) {
termsEnum2.seekExact(t.term, t.state);
docs = termsEnum2.postings(docs, PostingsEnum.NONE);
builder.add(docs);
}
}
// Then keep filling the bit set with remaining terms
do {
docs = termsEnum.postings(docs, PostingsEnum.NONE);
builder.add(docs);
} while (termsEnum.next() != null);
return new WeightOrDocIdSet(builder.build());
}
呼叫collectTerms預設只會提取查詢命中的16個關鍵字;
//MultiTermQueryConstantScoreWrapper.java
private static final int BOOLEAN_REWRITE_TERM_COUNT_THRESHOLD = 16;
private boolean collectTerms(
LeafReaderContext context, TermsEnum termsEnum, List<TermAndState> terms)
throws IOException {
final int threshold =
Math.min(BOOLEAN_REWRITE_TERM_COUNT_THRESHOLD, IndexSearcher.getMaxClauseCount());
for (int i = 0; i < threshold; ++i) {
final BytesRef term = termsEnum.next();
if (term == null) {
return true;
}
TermState state = termsEnum.termState();
terms.add(
new TermAndState(
BytesRef.deepCopyOf(term),
state,
termsEnum.docFreq(),
termsEnum.totalTermFreq()));
}
return termsEnum.next() == null;
}
通過以上分析wildcard查詢預設情況下,會提取欄位中所有命中查詢的關鍵字;
四、fvh Highlighter中wildcard的查詢重寫
在muti term query中,提取查詢關鍵字是高亮邏輯一個很重要的步驟;
我們使用以下高亮語句,分析以下高亮中提取查詢關鍵字過程中的查詢重寫;
{
"query":{
"wildcard":{
"text":{
"value":"m*"
}
}
},
"highlight":{
"fields":{
"text":{
"type":"fvh"
}
}
}
}
預設情況下只有匹配的欄位才會進行高亮,這裡構建CustomFieldQuery;
//FastVectorHighlighter.java
if (field.fieldOptions().requireFieldMatch()) {
/*
* we use top level reader to rewrite the query against all readers,
* with use caching it across hits (and across readers...)
*/
entry.fieldMatchFieldQuery = new CustomFieldQuery(
fieldContext.query,
hitContext.topLevelReader(),
true,
field.fieldOptions().requireFieldMatch()
);
}
通過呼叫flatten方法得到重寫之後的flatQueries,然後將每個提取的關鍵字重寫為BoostQuery;
//FieldQuery.java
public FieldQuery(Query query, IndexReader reader, boolean phraseHighlight, boolean fieldMatch)
throws IOException {
this.fieldMatch = fieldMatch;
Set<Query> flatQueries = new LinkedHashSet<>();
flatten(query, reader, flatQueries, 1f);
saveTerms(flatQueries, reader);
Collection<Query> expandQueries = expand(flatQueries);
for (Query flatQuery : expandQueries) {
QueryPhraseMap rootMap = getRootMap(flatQuery);
rootMap.add(flatQuery, reader);
float boost = 1f;
while (flatQuery instanceof BoostQuery) {
BoostQuery bq = (BoostQuery) flatQuery;
flatQuery = bq.getQuery();
boost *= bq.getBoost();
}
if (!phraseHighlight && flatQuery instanceof PhraseQuery) {
PhraseQuery pq = (PhraseQuery) flatQuery;
if (pq.getTerms().length > 1) {
for (Term term : pq.getTerms()) rootMap.addTerm(term, boost);
}
}
}
}
由於WildCardQuery是MultiTermQuery的子類,所以在flatten方法中最終直接使用MultiTermQuery.TopTermsScoringBooleanQueryRewrite進行查詢重寫,這裡的top N是MAX_MTQ_TERMS = 1024;
//FieldQuery.java
private static final int MAX_MTQ_TERMS = 1024;
protected void flatten(
Query sourceQuery, IndexReader reader, Collection<Query> flatQueries, float boost)
throws IOException {
..................................
..................................
else if (reader != null) {
Query query = sourceQuery;
Query rewritten;
if (sourceQuery instanceof MultiTermQuery) {
rewritten =
new MultiTermQuery.TopTermsScoringBooleanQueryRewrite(MAX_MTQ_TERMS)
.rewrite(reader, (MultiTermQuery) query);
} else {
rewritten = query.rewrite(reader);
}
if (rewritten != query) {
// only rewrite once and then flatten again - the rewritten query could have a speacial
// treatment
// if this method is overwritten in a subclass.
flatten(rewritten, reader, flatQueries, boost);
}
// if the query is already rewritten we discard it
}
// else discard queries
}
這裡首先計算設定的size和getMaxSize(預設值1024, IndexSearcher.getMaxClauseCount())計算最終提取的命中關鍵字數量,這裡最終是1024個;
這裡省略了傳入collectTerms的TermCollector匿名子類的實現,其餘最終提取關鍵字數量有關;
//FieldQuery.java
@Override
public final Query rewrite(final IndexReader reader, final MultiTermQuery query)
throws IOException {
final int maxSize = Math.min(size, getMaxSize());
final PriorityQueue<ScoreTerm> stQueue = new PriorityQueue<>();
collectTerms(
reader,
query,
new TermCollector() {
................
});
.............
return build(b);
}
這裡首先獲取查詢欄位對應的所有term集合,然後獲取所有的與查詢匹配的term集合,最終通過傳入的collector提取關鍵字;
//TermCollectingRewrite.java
final void collectTerms(IndexReader reader, MultiTermQuery query, TermCollector collector)
throws IOException {
IndexReaderContext topReaderContext = reader.getContext();
for (LeafReaderContext context : topReaderContext.leaves()) {
final Terms terms = context.reader().terms(query.field);
if (terms == null) {
// field does not exist
continue;
}
final TermsEnum termsEnum = getTermsEnum(query, terms, collector.attributes);
assert termsEnum != null;
if (termsEnum == TermsEnum.EMPTY) continue;
collector.setReaderContext(topReaderContext, context);
collector.setNextEnum(termsEnum);
BytesRef bytes;
while ((bytes = termsEnum.next()) != null) {
if (!collector.collect(bytes))
return; // interrupt whole term collection, so also don't iterate other subReaders
}
}
}
這裡通過控制最終提取匹配查詢的關鍵字的數量不超過maxSize;
//TopTermsRewrite.java
@Override
public boolean collect(BytesRef bytes) throws IOException {
final float boost = boostAtt.getBoost();
// make sure within a single seg we always collect
// terms in order
assert compareToLastTerm(bytes);
// System.out.println("TTR.collect term=" + bytes.utf8ToString() + " boost=" + boost + "
// ord=" + readerContext.ord);
// ignore uncompetitive hits
if (stQueue.size() == maxSize) {
final ScoreTerm t = stQueue.peek();
if (boost < t.boost) return true;
if (boost == t.boost && bytes.compareTo(t.bytes.get()) > 0) return true;
}
ScoreTerm t = visitedTerms.get(bytes);
final TermState state = termsEnum.termState();
assert state != null;
if (t != null) {
// if the term is already in the PQ, only update docFreq of term in PQ
assert t.boost == boost : "boost should be equal in all segment TermsEnums";
t.termState.register(
state, readerContext.ord, termsEnum.docFreq(), termsEnum.totalTermFreq());
} else {
// add new entry in PQ, we must clone the term, else it may get overwritten!
st.bytes.copyBytes(bytes);
st.boost = boost;
visitedTerms.put(st.bytes.get(), st);
assert st.termState.docFreq() == 0;
st.termState.register(
state, readerContext.ord, termsEnum.docFreq(), termsEnum.totalTermFreq());
stQueue.offer(st);
// possibly drop entries from queue
if (stQueue.size() > maxSize) {
st = stQueue.poll();
visitedTerms.remove(st.bytes.get());
st.termState.clear(); // reset the termstate!
} else {
st = new ScoreTerm(new TermStates(topReaderContext));
}
assert stQueue.size() <= maxSize : "the PQ size must be limited to maxSize";
// set maxBoostAtt with values to help FuzzyTermsEnum to optimize
if (stQueue.size() == maxSize) {
t = stQueue.peek();
maxBoostAtt.setMaxNonCompetitiveBoost(t.boost);
maxBoostAtt.setCompetitiveTerm(t.bytes.get());
}
}
return true;
}
通過以上分析可以看到,fvh Highlighter對multi term query的重寫,直接使用MultiTermQuery.TopTermsScoringBooleanQueryRewrite,並限制只能最多提取查詢關鍵字1024個;
五、重寫可能導致的高亮問題原因分析
經過以上對查詢和高亮的重寫過程分析可以知道,預設情況下
query階段提取的是命中查詢的所有的關鍵字,具體行為可以通過rewrite引數進行定製;
Highlight階段提取的是命中查詢的關鍵字中的前1024個,具體行為不受rewrite引數的控制;
如果查詢的欄位是大文字欄位,導致欄位的關鍵字很多,就可能會出現查詢命中的文件的關鍵字不在前1024個裡邊,從而導致明明匹配了文件,但是卻沒有返回高亮資訊;
六、解決方案
- 進一步明確查詢關鍵字,減少查詢命中的關鍵字的數量,例如輸入更多的字元,;
- 使用其他型別的查詢替換multi term query;