Sphinx Ranking Mode(排序模式) (翻譯)

Yi_Zhi_Yu發表於2015-03-04

渣渣英文水平,莫要介意

Search results ranking(搜尋結果排序)

Ranking overview(概覽)

Ranking (aka weighting) of the search results can be defined as a
process of computing a so-called relevance (aka weight) for every
given matched document with regards to a given query that matched it.
So relevance is in the end just a number attached to every document
that estimates how relevant the document is to the query. Search
results can then be sorted based on this number and/or some additional
parameters, so that the most sought after results would come up higher
on the results page.

排序(又名加權),是基於請求匹配到的結果,計算所謂的相關性(又名權重)的一個程式。 相關性是請求結束後被附加在文件結果中的一個估算出來的數值,表示匹配的文件於請求的關鍵詞相關的程度,然後搜尋的結果就能基於這個數值和其他的一些附加的引數進行排序,這樣大多數相關的結果就能排在前面。

There is no single standard one-size-fits-all way to rank any document
in any scenario. Moreover, there can not ever be such a way, because
relevance is subjective. As in, what seems relevant to you might not
seem relevant to me. Hence, in general case it`s not just hard to
compute, it`s theoretically impossible.

對排序來說,在任何場景中都沒有適應所有的情況的標準,甚至可以說不可能有這種標準,因為相關性是一種很主觀的東西,比如,對你來說相關性很強,對我來說卻沒有。因而一般很難去計算,理論上是不可能的。

So ranking in Sphinx is configurable. It has a of a so-called . A
ranker can formally be defined as a function that takes document and
query as its input and produces a relevance value as output. In
layman`s terms, a ranker controls exactly how (using which specific
algorithm) will Sphinx assign weights to the document.

在sphinx中, 排序其實是可配置的, 他有一個叫ranker(這裡我翻譯成排序器)的概念, 根據定義的方法, 把匹配的文件和請求作為輸入,輸出來一個相關性的值。 簡而言之, 一個ranker可以精確的給每個文件計算出相關性的值。

Previously, this ranking function was rigidly bound to the matching

  1. So in the legacy matching modes (that is, SPH_MATCH_ALL,
    SPH_MATCH_ANY, SPH_MATCH_PHRASE, and SPH_MATCH_BOOLEAN) you can not

choose the ranker. You can only do that in the SPH_MATCH_EXTENDED

  1. (Which is the only mode in SphinxQL and the suggested mode in
    SphinxAPI anyway.) To choose a non-default ranker you can either use

SetRankingMode() with SphinxAPI, or OPTION ranker clause in SELECT
statement when using SphinxQL.

以前,相排序方法被硬性的於匹配模式繫結在一起, 所以在一些老的匹配模式中(比如 SPH_MATCH_ALL, SPH_MATCH_ANY, SPH_MATCH_PHRASE, and SPH_MATCH_BOOLEAN), 你不能選擇ranker(排序器)。你只能在SPH_MATCH_EXTENDED(這也是在sphinxsql和sphinxApi中被建議使用的唯一的一種模式)模式下選擇。 如何選擇一個非預設的ranker(排序器),在SphinxApi中使用SetRankingMode()方法,在SphinxQL中設定ranker選項

As a sidenote, legacy matching modes are internally implemented via
the unified syntax anyway. When you use one of those modes, Sphinx
just internally adjusts the query and sets the associated ranker, then
executes the query using the very same unified code path.

注意,老的匹配模式被內建了統一的語法,當你使用這些模式的時候,sphinx僅僅內部判斷請求和設定相應的ranker,然後使用相同的程式碼路徑去執行這些請求。

Available built-in rankers(內建的ranker)

Sphinx ships with a number of built-in rankers suited for different

  1. A number of them uses two factors, phrase proximity (aka
    LCS) and BM25. Phrase proximity works on the keyword positions, while

BM25 works on the keyword frequencies. Basically, the better the
degree of the phrase match between the document body and the query,
the higher is the phrase proximity (it maxes out when the document
contains the entire query as a verbatim quote). And BM25 is higher
when the document contains more rare words. We`ll save the detailed
discussion for later.

Sphinx 內建了一系列的ranker, 用於不同的目的。他們中都是基於兩個因素, phrase proximity(又名LCS)和BM25, Phrase proximity用於表示關鍵字與關鍵字的位置有關, BM25於關鍵詞的出現的頻率有關。基本上, 請求與匹配的文件越接近, phrase proximity就越高(當文件完整的包含整個請求的關鍵字時最高)。當文件中包含的關鍵詞越多,BM25就越高。我們稍候討論這些細節

Currently implemented rankers are:

當前內建的ranker有:

1.SPH_RANK_PROXIMITY_BM25, the default ranking mode that uses and
combines both phrase proximity and BM25 ranking.

1.SPH_RANK_PROXIMITY_BM25, 預設的ranker,基於hrase proximity and BM25 ranking兩個因素

2.SPH_RANK_BM25, statistical ranking mode which uses BM25 ranking only
(similar to most other full-text engines). This mode is faster but may
result in worse quality on queries which contain more than 1 keyword.

2.SPH_RANK_BM25, 當僅僅使用BM25這種排序因素的時候的模式(於大多數其他的全文引擎相似),這種模式雖然快,但結果的質量不高,很多結果包含的關鍵詞不止一個(即關鍵字越多,分值越高,但很多時候我們最想要的僅僅是一個完全命中的結果)

3.SPH_RANK_NONE, no ranking mode. This mode is obviously the fastest. A
weight of 1 is assigned to all matches. This is sometimes called
boolean searching that just matches the documents but does not rank
them.

3.SPH_RANK_NONE 沒有任何排序模式的模式,這種模式很明顯最快, 所有匹配的文件的權重都是1, 有時候被成為布林搜尋,這種搜尋僅僅搜尋文件,但不會排序

  1. SPH_RANK_WORDCOUNT, ranking by the keyword occurrences count. This

  2. computes the per-field keyword occurrence counts, then

multiplies them by field weights, and sums the resulting values.

4.SPH_RANK_WORDCOUNT 根據關鍵字出現的次數排序,這種排序方式的計算是基於每個欄位的關鍵字出現的次數,然後整合這些欄位的權重得出的結果。

5.SPH_RANK_PROXIMITY, added in version 0.9.9-rc1, returns raw phrase
proximity value as a result. This mode is internally used to emulate
SPH_MATCH_ALL queries.

5.SPH_RANK_PROXIMITY, 這種排序返回的是每個文件於請求的相似程度,這種模式被內建用來在SPH_MATCH_ALL匹配模式的時候排序

6.SPH_RANK_MATCHANY, added in version 0.9.9-rc1, returns rank as it was
computed in SPH_MATCH_ANY mode earlier, and is internally used to
emulate SPH_MATCH_ANY queries.

6.SPH_RANK_MATCHANY, 早期的時候在SPH_MATCH_ANY匹配模式中使用, 返回相關值,在SPH_MATCH_ANY模式中內建的就是這種排序模式

7.SPH_RANK_FIELDMASK, added in version 0.9.9-rc2, returns a 32-bit mask
with N-th bit corresponding to N-th fulltext field, numbering from 0.
The bit will only be set when the respective field has any keyword
occurrences satisfying the query.

7.SPH_RANK_FIELDMASK 返回一個32位的掩碼, 每個位都對應一個相應的全文欄位(不能應該要補零), 從0開始, 只有當相應的欄位有關鍵字出現的時候才會被置1

8.SPH_RANK_SPH04, added in version 1.10-beta, is generally based on the
default SPH_RANK_PROXIMITY_BM25 ranker, but additionally boosts the
matches when they occur in the very beginning or the very end of a
text field. Thus, if a field equals the exact query, SPH04 should rank
it higher than a field that contains the exact query but is not equal
to it. (For instance, when the query is “Hyde Park”, a document
entitled “Hyde Park” should be ranked higher than a one entitled “Hyde
Park, London” or “The Hyde Park Cafe”.)

8.SPH_RANK_SPH04, 基於預設的SPH_RANK_PROXIMITY_BM25模式, 但假如匹配的文件的開頭或者結尾出現了,那麼這個文件的相關值就會提升,所以,如果某個文件的一個欄位完全於請求的關鍵字一致, 那麼這種模式下的排序的位置就應該比文件中包含請求關鍵字的文件高。(比如,如果請求的關鍵字是”Hyde Park”, “Hyde Park”的文件就會比”Hyde Park, London”或者”The Hyde Park Cafe”的排序高)

9.SPH_RANK_EXPR, added in version 2.0.2-beta, lets you specify the
ranking formula in run time. It exposes a number of internal text
factors and lets you define how the final weight should be computed
from those factors.

9.SPH_RANK_EXPR 這種模式讓你能在執行的時候指定排序規則, 他暴露了一系列的內建的文字的因素, 讓你能基於這些因素計算出最終的權重

You should specify the SPH_RANK_ prefix and use capital letters only
when using the SetRankingMode() call from the SphinxAPI. The API ports
expose these as global constants. Using SphinxQL syntax, the prefix
should be omitted and the ranker name is case insensitive.

你可以指定一個SPH_RANK_為字首的排序模式,要全部大寫。在SphinxAPI中使用SetRankingMode()方法,這個API中定義了這些模式的全域性常量。 在SphinxQL中, 這個字首要被對映,且ranker的名稱是大小寫敏感的(就是要指定ranker模式的引數選項)

Example:

// SphinxAPI
$client->SetRankingMode ( SPH_RANK_SPH04 );

// SphinxQL
mysql_query ( "SELECT ... OPTION ranker=sph04" );

相關文章