由“香蕉”引出的字串匹配演算法的問題

whysear發表於2014-09-10

這周的早些時候我注意到String slices in Rust上的一些方法在文件裡沒有例子，所以我想試著寫一些。第一個我打算為之寫一個例子的方法是contains，contians可以測試一個字串是否是另一個的子字串。思索了一下我決定寫一個像這樣的例子：

“bananas”.contains(“nana”)

1	“bananas”.contains(“nana”)

由於我從沒有用過這個方法，我想確認這個方法是按照我預想的方式執行，所以我在http://play.rust-lang.org上試執行這個例子。

返回的結果是false。

在複核這個文件並嘗試了其他例子（都執行正常）以後，我懷疑play.rust-lang.org的這個方法有一些奇怪的問題。我決定要在本地測試一下這個方法。結果一模一樣。我下載了最新的Rust nightly並再次執行這個例子。Rust再一次通知我”bananas”不包含”nana”。

我決定用”bananas”的每一個子字串去檢驗contains，以證明事實上不是我在現實生活中突然忘了字母是如何工作的：

fn main() {
    let b = "bananas";
    for i in range(0, b.len()) {
        for j in range(i, b.len() + 1) {
            let curr = b.slice(i, j);
            println!("{} - {}", b.contains(curr), curr);
        }
    }
}

fn main() {

let b = "bananas";

for i in range(0, b.len()) {

for j in range(i, b.len() + 1) {

let curr = b.slice(i, j);

println!("{} - {}", b.contains(curr), curr);

}

執行的結果是：

true - 
true - b
true - ba
true - ban
true - bana
true - banan
true - banana
true - bananas
true - 
true - a
true - an
true - ana
true - anan
true - anana
true - ananas
true - 
true - n
true - na
true - nan
false - nana
true - nanas
true - 
true - a
true - an
true - ana
true - anas
true - 
true - n
true - na
true - nas
true - 
true - a
true - as
true - 
true - s

true -

true - b

true - ba

true - ban

true - bana

true - banan

true - banana

true - bananas

true -

true - a

true - an

true - ana

true - anan

true - anana

true - ananas

true -

true - n

true - na

true - nan

false - nana

true - nanas

true -

true - a

true - an

true - ana

true - anas

true -

true - n

true - na

true - nas

true -

true - a

true - as

true -

true - s

“nana”是唯一”bananas”呼叫contains方法返回值為false的子字串。

我感到高興起來，我發現了Rust字串匹配的實現的bug。因為我那時在Hacker School，我也沒有比花費一整天去捕獲那些預發行的程式語言的標準庫的隱藏的bug更好的事了。於是我決定修正這個bug。

這個獨特的問題是由兩個獨立的bugs導致的結果。包含第一個bug的程式碼：

impl Searcher {
    fn new(haystack: &amp;[u8], needle: &amp;[u8]) -&gt; Searcher {
        // FIXME: Tune this.
        if needle.len() &gt; haystack.len() - 20 {
            Naive(NaiveSearcher::new())
        } else {
            let searcher = TwoWaySearcher::new(needle);
            if searcher.memory == uint::MAX { // If the period is long
                TwoWayLong(searcher)
            } else {
                TwoWay(searcher)
            }
        }
    }
}

impl Searcher {

fn new(haystack: &[u8], needle: &[u8]) -> Searcher {

// FIXME: Tune this.

if needle.len() > haystack.len() - 20 {

Naive(NaiveSearcher::new())

} else {

let searcher = TwoWaySearcher::new(needle);

if searcher.memory == uint::MAX { // If the period is long

TwoWayLong(searcher)

} else {

TwoWay(searcher)

}

這是一個Searcher的建構函式，用於執行字串的匹配。這段程式碼的意圖是當haystack（我們在這個字串裡搜尋）的長度和needle（我們搜尋的目標字串）的長度之間的差距小於20的時候使用NaiveSearcher，否則使用TwoWaySearcher。（NaiveSearcher是一個簡單字串匹配演算法的實現，）

然而，當haystack.len()小於20的時候，haystack.len() – 20會變成一個很大的數，我們碰到了一個下溢的錯誤。當haystack的長度小於20的時候這個bug致使這段程式碼錯誤的呼叫了TwoWaySearcher，但是在”bananas”.contains(“nana”)這樣特殊的情況下，解決方法是needle增加20上而不是haystack減去20：

if needle.len() + 20 &gt; haystack.len() {

1	if needle.len() + 20 > haystack.len() {

我對第一個bug的修改請求在這裡。

如此上述修改僅僅通過這個方法呼叫一個不同的簡單沒有問題的字串匹配演算法就修正了我在”bananas”的例子發現的問題。但是在別的字串匹配演算法TwoWaySearcher上依然還有一個問題。如在例子”012345678901234567890anana”.contains(“nana”) 中的haystack大到足以呼叫TwoWaySearcher，但是任然返回false。

這個問題證明更難精確度定位，我也不十分肯定我提交的修改是正確的，所以我就簡略的說明一下吧。TwoWaySearcher是“Type-way algorithm“的實現的一種，最早的介紹在這篇論文裡。通過閱讀該論文和有精彩評論的glibc implementation of Two way algorithm。我注意到有一部分Rust的程式碼沒有準確的匹配該論文提到的東西。你可以在這篇論文的p.670明確找到這個函式：

Small-Period函式的虛擬碼

Two-Way演算法需要兩個字串作為輸入，一個是“haystack“，一個是”needle“，並試圖在提供的haystack中找到needle第一次出現的起始位置。在這個特定演算法中，第一步是以一種遵守確定規律的方式把”needle“變成兩部分（就像，發現一對字串(u,v),比如needle=u+v）。

現在比較上面的虛擬碼和在Rust實現中的TwoWaySearcher Constructor的程式碼。

fn new(needle: &amp;[u8]) -&gt; TwoWaySearcher {
    let (critPos1, period1) = TwoWaySearcher::maximal_suffix(needle, false);
    let (critPos2, period2) = TwoWaySearcher::maximal_suffix(needle, true);

    let critPos;
    let period;
    if critPos1 &gt; critPos2 {
        critPos = critPos1;
        period = period1;
    } else {
        critPos = critPos2;
        period = period2;
    }

    let byteset = needle.iter()
                        .fold(0, |a, &amp;b| (1 &lt;&lt; ((b &amp; 0x3f) as uint)) | a);

    if needle.slice_to(critPos) == needle.slice_from(needle.len() - critPos) {
        TwoWaySearcher {
            critPos: critPos,
            period: period,
            byteset: byteset,

            position: 0,
            memory: 0
        }
    } else {
        TwoWaySearcher {
            critPos: critPos,
            period: cmp::max(critPos, needle.len() - critPos) + 1,
            byteset: byteset,

            position: 0,
            memory: uint::MAX // Dummy value to signify that the period is long
        }
    }
}

fn new(needle: &[u8]) -> TwoWaySearcher {

let (critPos1, period1) = TwoWaySearcher::maximal_suffix(needle, false);

let (critPos2, period2) = TwoWaySearcher::maximal_suffix(needle, true);

let critPos;

let period;

if critPos1 > critPos2 {

critPos = critPos1;

period = period1;

} else {

critPos = critPos2;

period = period2;

}

let byteset = needle.iter()

.fold(0, |a, &b| (1 << ((b & 0x3f) as uint)) | a);

if needle.slice_to(critPos) == needle.slice_from(needle.len() - critPos) {

TwoWaySearcher {

critPos: critPos,

period: period,

byteset: byteset,

position: 0,

memory: 0

}

} else {

TwoWaySearcher {

critPos: critPos,

period: cmp::max(critPos, needle.len() - critPos) + 1,

byteset: byteset,

position: 0,

memory: uint::MAX // Dummy value to signify that the period is long

}

特別注意在TwoWaySearcher::new()結尾處的if語句。第一分支使用period，第二分支使用max(critPos, needle.len() – critPos) + 1.這兩行相當於上面虛擬碼的第5行和第6行。剩下的都匹配的相當好，if語句條件檢查的異常：

needle.slice_to(critPos)==needle.slice_from(needl e.len() - critPos)

1	needle.slice_to(critPos)==needle.slice_from(needl e.len() - critPos)

在類Python的虛擬碼中，這行程式碼是在檢查if needle[:1] == needle[(n – 1):],我使用了1代替cirtPos，n代替needle的長度。與論文的程式碼相比較，論文程式碼規定我們需要檢查needle[:1]是否是needle[1: (p+1)]的字尾，p是needle的字尾的period，這兩個檢查不是一樣的。我嚴重懷疑是這段程式碼的問題。

一個等價的檢查是通過檢查if needle[: 1] == needle[p: (p+1)]needle[:1]是否是needle[1: (p + 1)]的字尾。所以我猜我們需要在Rust中把if條件改成這樣：

needle.slice_to(critPos) == needle.slice(period, period + critPos)

1	needle.slice_to(critPos) == needle.slice(period, period + critPos)

這個新的邏輯好像也與glibc實現匹配，因為two_way_long_needle函式有像下面的if語句：

/* Perform the search. Each iteration compares the right half
     first. */
  if (CMP_FUNC (needle, needle + period, suffix) == 0)
    {
      /* Entire needle is periodic; a mismatch can only advance by the
     period, so use memory to avoid rescanning known occurrences
     of the period.  */

      ...

    }
  else
    {
      /* The two halves of needle are distinct; no extra memory is
     required, and any mismatch results in a maximal shift.  */

      ...

    }

/* Perform the search. Each iteration compares the right half

first. */

if (CMP_FUNC (needle, needle + period, suffix) == 0)

{

/* Entire needle is periodic; a mismatch can only advance by the

period, so use memory to avoid rescanning known occurrences

of the period. */

...

}

else

{

/* The two halves of needle are distinct; no extra memory is

required, and any mismatch results in a maximal shift. */

...

}

（你需要檢查CMP_FUNC巨集以便看看為什麼它是一樣的）。還有，在做完上述修改以後，所有上面失敗的例子現在都正常執行了。

我給這些修改做了一個open PR，由於我沒有花時間完全理解這些演算法所以我不能完全肯定它是正確的。

有趣的是如果我沒有注意到第一個bug，我也不會發現第二個。

打賞支援我翻譯更多好文章，謝謝！
打賞譯者

打賞支援我翻譯更多好文章，謝謝！

任選一種支付方式

由“香蕉”引出的字串匹配演算法的問題

字串匹配問題——KMP演算法
2018-03-23
字串匹配KMP演算法
字串匹配模式問題
2018-07-03
字串匹配模式
mysql 刪表引出的問題
2020-04-05
MySql
二分搜尋演算法-吃香蕉問題
2024-08-21
演算法
第五章字串專題 ---------------- 字串匹配(二)----KMP演算法
2019-03-19
字串匹配KMP演算法
spring boot 2.0.0由於版本不匹配導致的NoSuchMethodError問題解析
2018-09-25
Spring BootError
神奇的字串匹配：擴充套件KMP演算法
2020-10-05
字串匹配套件KMP演算法
KMP字串匹配演算法
2023-10-01
KMP字串匹配演算法
字串匹配演算法：KMP
2023-11-04
字串匹配演算法KMP
字串匹配演算法(一)
2021-07-30
字串匹配演算法
一些常見的字串匹配演算法
2023-04-25
字串匹配演算法
荷蘭國旗問題引出的三色排序
2018-07-16
排序
python安裝MySQL庫引出的一些問題
2018-07-03
PythonMySql
[20231029]使用cygwin調式bash shell引出的問題.txt
2023-10-30
從閉包引出來的一系列問題
2021-09-09
第五章字串專題 ---------------- 字串匹配(三)----字尾陣列演算法
2019-03-19
字串匹配陣列演算法
字串匹配之KMP《演算法很美》
2021-01-04
字串匹配KMP演算法
由哪個log先輸出引出的event loop思考
2018-05-15
OOP
沉浸進行時 —— 由Diegetic UI引出的概念學習
2021-10-25
UI
由分號引起的問題
2019-02-26
一道演算法題，引出collections.Counter的特殊用法
2020-10-14
演算法
字串匹配演算法(三)-KMP演算法
2021-08-02
字串匹配演算法KMP
字串匹配-BF演算法和KMP演算法
2021-03-13
字串匹配演算法KMP
Next.js-頁面重複渲染引出的水合問題
2023-03-02
JS
幾道 BAT 演算法面試中經常問的「字串」問題
2019-04-04
BAT演算法面試字串
字串匹配基礎下——KMP 演算法
2018-12-11
字串匹配KMP演算法
kmp字串匹配，A星尋路演算法
2018-09-21
KMP字串匹配演算法
字串匹配演算法【未完待續】
2020-01-21
字串匹配演算法
字串匹配
2019-05-11
字串匹配
匹配字串之——KMP演算法深入理解
2018-11-20
字串KMP演算法
快速字串匹配一: 看毛片演算法（KMP）
2019-08-05
字串匹配演算法KMP
【資料結構與演算法】字串匹配
2020-10-23
資料結構演算法字串匹配
藍橋杯演算法提高——字串匹配（Java）
2020-11-08
演算法字串匹配Java
演算法之字串——正規表示式匹配
2020-09-29
演算法字串
字串匹配演算法(二)-BM演算法詳解
2021-07-31
字串匹配演算法
BZOJ4259: 殘缺的字串(FFT 字串匹配)
2019-02-09
FFT字串匹配
Java訊號量實現程式同步問題：水果蘋果香蕉問題
2020-11-04
Java蘋果
字串轉數字的問題
2024-11-13
字串
由VIP漂移引發的演算法異常問題調查和解決
2020-07-05
演算法

由“香蕉”引出的字串匹配演算法的問題

打賞支援我翻譯更多好文章，謝謝！

相關文章