split() 函式解析 (一)

居合子發表於2019-03-04

原文網址 : https://flycode.co/archives/285036

起源

突然研究split()函式是有一定原因的，昨天晚上有個厲害的學長在實驗室的群裡拋了這樣一個問題：

假設存在一個陣列 array ={“AB”, “12”},還存在一個字串string = abcAB0123，有一個函式f(String s)，
使得 {“abc”, “AB”, “0”， “12”, “3”} == f(string)。也就是把string按array中的元素拆分。求解這個·f函式。
ps：string.split(“AB|12″)的到的結果是｛”abc”, “0”, “3”｝，不滿足條件。

既然string.split()無法實現，那麼看一下split()的原始碼是如何實現的，對其進行改造就可以了。

原始碼追蹤

首先是String.split(String regex)，它的實現是這樣的：

public String[] split(String regex) {
    return split(regex, 0);
}複製程式碼

繼續追蹤，看一下String.split(String regex, int limit)的內部是如何實現的：

/**
* @param  regex
*         the delimiting regular expression
* 這裡的regex可以是一個正規表示式
* @param  limit
*         the result threshold, as described above
* 限制切割字串的段數
* @return  the array of strings computed by splitting this string
*          around matches of the given regular expression
*
* @throws  PatternSyntaxException
*          if the regular expression`s syntax is invalid
*/
public String[] split(String regex, int limit) {
    /* fastpath if the regex is a
     (1)one-char String and this character is not one of the
        RegEx`s meta characters ".$|()[{^?*+\", or
     (2)two-char String and the first char is the backslash and
        the second is not the ascii digit or ascii letter.
     */
    char ch = 0;
    if (((regex.value.length == 1 &&
         ".$|()[{^?*+\".indexOf(ch = regex.charAt(0)) == -1) ||
         (regex.length() == 2 &&
          regex.charAt(0) == `\` &&
          (((ch = regex.charAt(1))-`0`)|(`9`-ch)) < 0 &&
          ((ch-`a`)|(`z`-ch)) < 0 &&
          ((ch-`A`)|(`Z`-ch)) < 0)) &&
        (ch < Character.MIN_HIGH_SURROGATE ||
         ch > Character.MAX_LOW_SURROGATE))
    {
        int off = 0;  //偏移量
        int next = 0; //下一次切割的地方
        boolean limited = limit > 0;  //判斷是否有限制，如果limit = 0則表示無限制
        ArrayList<String> list = new ArrayList<>(); //盛裝切割之後的字串
        while ((next = indexOf(ch, off)) != -1) { //offset之後還有該字元
            if (!limited || list.size() < limit - 1) {
                list.add(substring(off, next));
                off = next + 1;
            } else {    // last one
                //assert (list.size() == limit - 1);
                list.add(substring(off, value.length));
                off = value.length;
                break;
            }
        }
        // If no match was found, return this
        if (off == 0)
            return new String[]{this};

        // Add remaining segment
        if (!limited || list.size() < limit)
            list.add(substring(off, value.length));

        // Construct result
        int resultSize = list.size();
        if (limit == 0) {
            while (resultSize > 0 && list.get(resultSize - 1).length() == 0) {
                resultSize--;
            }
        }
        String[] result = new String[resultSize];
        return list.subList(0, resultSize).toArray(result);
    }
    return Pattern.compile(regex).split(this, limit);
}複製程式碼

首先看一下函式第一行的註釋：

fastpath if the regex is a
(1)one-char String and this character is not one of the
RegEx`s meta characters “.$|()[{^?*+”, or
(2)two-char String and the first char is the backslash and
the second is not the ascii digit or ascii letter.

翻譯過來就是：

當regex是一個字元的字串並且這個字元不是正規表示式中的元字元或者
regex是兩個字元的字串並且第一個字元是`/`，第二個字元不是數字或者字母的時候（其實這裡也相當於一個字元，使一些轉義字元）

本函式的時間複雜度會很低（fastpath）？

那麼對應的程式碼就是：

if (((regex.value.length == 1 &&
             ".$|()[{^?*+\".indexOf(ch = regex.charAt(0)) == -1) ||
             (regex.length() == 2 &&
              regex.charAt(0) == `\` &&
              (((ch = regex.charAt(1))-`0`)|(`9`-ch)) < 0 &&
              ((ch-`a`)|(`z`-ch)) < 0 &&
              ((ch-`A`)|(`Z`-ch)) < 0)) &&
            (ch < Character.MIN_HIGH_SURROGATE ||
             ch > Character.MAX_LOW_SURROGATE))複製程式碼

上面的這個if語句就是是否符合fastpath的條件：

(regex.value.length == 1 && ".$|()[{^?*+\".indexOf(ch = regex.charAt(0)) == -1) 長度為1並且不是正規表示式中的元字元. $ | ( ) [ { ^ ? * +
(regex.length() == 2 && regex.charAt(0) == `\` && (((ch = regex.charAt(1))-`0`)|(`9`-ch)) < 0 && ((ch-`a`)|(`z`-ch)) < 0 && ((ch-`A`)|(`Z`-ch)) < 0) 長度為2，第一個字元是“ 並且第二個字元不是字母或者數字。((ch-`A`)|(`Z`-ch)) < 0這個判斷是否為字元的方式很nice，很獨特！
(ch < Character.MIN_HIGH_SURROGATE || ch > Character.MAX_LOW_SURROGATE) 這是一個必要條件，字元必須在範圍內

再繼續往下看，定義了幾個變數，分別是：

int off = 0;  //偏移量
int next = 0; //下一次出現目標字元的位置
boolean limited = limit > 0;  //是否有限制
ArrayList<String> list = new ArrayList<>(); //盛裝切割之後的字元片段複製程式碼

然後就是一個while迴圈：while ((next = indexOf(ch, off)) != -1)

出現了一個indexOf(int ch, int fromIndex)方法，這個方法的英文解釋是這樣的：

Returns the index within this string of the first occurrence of the specified character, starting the search at the specified index.

翻譯一下就是：返回目標字元在字串中fromIndex位置之後第一次 出現的位置，如果沒有的話就返回-1。

所以這個while迴圈就是當剩下的字串還有目標字元的話，就會繼續迴圈。

接下來的部分就簡單多了，根據offset和目標出現的下一個位置使用substring函式對字串進行切割，並把切割下的部分新增到list中去，如果這是目標的最後一次出現位置或者超出limit的範圍，直接把字串最後的部分新增到list中。注意，每次迴圈都要調整偏移量，如果不是最後一次迴圈，令off = next + 1。

迴圈結束之後，如果off仍然是0，說明沒有匹配到，直接返回就好。然後把list轉換為字串陣列返回就可以了。

更加深入

你以為就這樣結束了？

別忘了，我們上面討論的只是fastpath情況，也就是用一個字元切割字串。

Pattern.compile(regex).split(this, limit) 這是另一種情況，請等待我的下回講解。

PS：那位學長已經解決了這個那個問題，方法是：使用構詞器。

PHP 每日一函式 — 字串函式 chunk_split ()
2020-04-28
PHP函式字串
mysql 效果類似split函式
2024-07-25
MySql函式
php中的chunk_split()和str_split()字串函式
2021-09-09
PHP字串函式
工作學習筆記（一） StringUtils.split()和string.split()函式的區別
2024-11-06
筆記函式
Python中的split()函式的用法
2020-09-25
Python函式
Perl split字串分割函式用法指南
2023-01-08
字串函式
PHP之string之str_split()函式使用
2019-02-16
PHP函式
spark RDD的學習，filter函式的學習，split函式的學習
2018-08-01
SparkFilter函式
mysql函式substring_index實現split切割效果
2024-05-31
MySql函式Index
Python-split()函式用法及簡單實現
2021-09-11
Python函式
解析Count函式
2024-07-10
函式
oracle json 解析函式
2018-12-03
OracleJSON函式
命令列解析函式
2019-07-06
命令列函式
裝置樹解析函式
2024-03-23
函式
pd.Series()函式解析
2019-09-04
函式
正規表示式 split()
2018-09-07
深入解析 Spring AI 系列：解析函式呼叫
2025-01-16
SpringAI函式
bind函式polyfill原始碼解析
2019-01-19
函式原始碼
python基礎--函式全解析
2020-06-18
Python函式
Python中的main函式解析
2021-09-11
PythonAI函式
正規表示式 split()方法
2018-09-07
【複變函式與積分變換】02. 解析函式
2020-11-06
函式
JavaScript 之有趣的函式（函式宣告、呼叫、預解析、作用域）
2019-07-08
JavaScript函式
函式一
2020-10-12
函式
main函式解析（一）——Linux-0.11 學習筆記（五）
2020-04-04
AI函式Linux筆記
由函式的解析式給出函式的性質 | 你想到了嗎
2024-12-10
函式
snabbdom原始碼解析（二） h函式
2019-02-16
原始碼函式
Python 偏函式用法全方位解析
2020-06-25
Python函式
解析Linux訊號集操作函式
2020-10-21
Linux函式
C++ 虛擬函式表解析
2024-09-13
C++函式
PHP中preg_replace函式解析
2024-08-03
PHP函式
C++如何解析函式呼叫
2021-03-16
C++函式
PHP 每日一函式 — 字串函式 crypt ()
2020-05-04
PHP函式字串
PHP 每日一函式 — 字串函式 chr ()
2020-04-29
PHP函式字串
PHP 每日一函式 — 字串函式 addslashes ()
2020-04-26
PHP函式字串
PHP 每日一函式 — 字串函式 addcslashes ()
2020-04-25
PHP函式字串
MySQL函式學習(一)-----字串函式
2022-01-23
MySql函式字串
python函式每日一講 - int()函式
2021-09-09
Python函式

split() 函式解析 (一)

起源

原始碼追蹤

更加深入

相關文章