5.Python3原始碼—字串（str）物件

whj0709發表於2018-06-06

原文網址 : https://flycode.co/archives/111974

5.1. 字串物件

字串物件是“變長物件”。

5.1.1. Python中的建立

Python中字串（strs）物件最重要的建立方法為PyUnicode_DecodeUTF8Stateful，如下Python語句最終會呼叫到PyUnicode_DecodeUTF8Stateful：

a = `hello
b = str(`world`)

5.1.2. PyUnicode_DecodeUTF8Stateful的C呼叫棧

詞法解析，最終調到PyUnicode_DecodeUTF8Stateful，呼叫順序如下：

// ast.c
ast_for_expr
=>ast_for_power
=>ast_for_atom_expr
=>ast_for_atom (case STRING)
=>parsestrplus
=>parsestr

// unicodeobject.c
=> PyUnicode_DecodeUTF8Stateful

5.1.3. PyUnicode_DecodeUTF8Stateful原始碼

// unicodeobject.c
PyObject *
PyUnicode_DecodeUTF8Stateful(const char *s,
                             Py_ssize_t size,
                             const char *errors,
                             Py_ssize_t *consumed)
{
    _PyUnicodeWriter writer;
    const char *starts = s;
    const char *end = s + size;

    Py_ssize_t startinpos;
    Py_ssize_t endinpos;
    const char *errmsg = "";
    PyObject *error_handler_obj = NULL;
    PyObject *exc = NULL;
    _Py_error_handler error_handler = _Py_ERROR_UNKNOWN;

    if (size == 0) {
        if (consumed)
            *consumed = 0;
        _Py_RETURN_UNICODE_EMPTY();
    }

    /* ASCII is equivalent to the first 128 ordinals in Unicode. */
    if (size == 1 && (unsigned char)s[0] < 128) {
        if (consumed)
            *consumed = 1;
        return get_latin1_char((unsigned char)s[0]);
    }

    _PyUnicodeWriter_Init(&writer);
    writer.min_length = size;
    if (_PyUnicodeWriter_Prepare(&writer, writer.min_length, 127) == -1)
        goto onError;

    writer.pos = ascii_decode(s, end, writer.data);
    s += writer.pos;
    while (s < end) {
        // ascii解碼後的size小於傳入的size
    }

End:
    if (consumed)
        *consumed = s - starts;

    Py_XDECREF(error_handler_obj);
    Py_XDECREF(exc);
    return _PyUnicodeWriter_Finish(&writer);

onError:
    Py_XDECREF(error_handler_obj);
    Py_XDECREF(exc);
    _PyUnicodeWriter_Dealloc(&writer);
    return NULL;
}

可以看到：

空串快取：空串（unicode_empty）為同一個地址，第二次需要空串時，只是將計數加1，在_PyUnicodeWriter_Finish中實現空串快取。

// unicodeobject.c
static PyObject *unicode_empty = NULL;

#define _Py_INCREF_UNICODE_EMPTY()                      
    do {                                                
        if (unicode_empty != NULL)                      
            Py_INCREF(unicode_empty);                   
        else {                                          
            unicode_empty = PyUnicode_New(0, 0);        
            if (unicode_empty != NULL) {                
                Py_INCREF(unicode_empty);               
                assert(_PyUnicode_CheckConsistency(unicode_empty, 1)); 
            }                                           
        }                                               
    } while (0)

#define _Py_RETURN_UNICODE_EMPTY()                      
    do {                                                
        _Py_INCREF_UNICODE_EMPTY();                     
        return unicode_empty;                           
    } while (0)

// PyUnicode_DecodeUTF8Stateful->
// _PyUnicodeWriter_Finish->
// unicode_result_ready
static PyObject*
unicode_result_ready(PyObject *unicode)
{
    Py_ssize_t length;

    length = PyUnicode_GET_LENGTH(unicode);
    if (length == 0) {
        if (unicode != unicode_empty) {
            Py_DECREF(unicode);
            _Py_RETURN_UNICODE_EMPTY();
        }
        return unicode_empty;
    }

    if (length == 1) {
        void *data = PyUnicode_DATA(unicode);
        int kind = PyUnicode_KIND(unicode);
        Py_UCS4 ch = PyUnicode_READ(kind, data, 0);
        if (ch < 256) {
            PyObject *latin1_char = unicode_latin1[ch];
            if (latin1_char != NULL) {
                if (unicode != latin1_char) {
                    Py_INCREF(latin1_char);
                    Py_DECREF(unicode);
                }
                return latin1_char;
            }
            else {
                assert(_PyUnicode_CheckConsistency(unicode, 1));
                Py_INCREF(unicode);
                unicode_latin1[ch] = unicode;
                return unicode;
            }
        }
    }

    assert(_PyUnicode_CheckConsistency(unicode, 1));
    return unicode;
}

字元緩衝池：字元（unicode_latin1）為同一個地址，第二次需要該字元時，只是將計數加1，在get_latin1_char中實現字元快取。

// unicodeobject.c
static PyObject *unicode_latin1[256] = {NULL};

PyObject *
PyUnicode_DecodeUTF8Stateful(const char *s,
                             Py_ssize_t size,
                             const char *errors,
                             Py_ssize_t *consumed)
{
      // do sth.

    /* ASCII is equivalent to the first 128 ordinals in Unicode. */
    if (size == 1 && (unsigned char)s[0] < 128) {
        if (consumed)
            *consumed = 1;
        return get_latin1_char((unsigned char)s[0]);
    }

      // do sth.
}

static PyObject*
get_latin1_char(unsigned char ch)
{
    PyObject *unicode = unicode_latin1[ch];
    if (!unicode) {
        unicode = PyUnicode_New(1, ch);
        if (!unicode)
            return NULL;
        PyUnicode_1BYTE_DATA(unicode)[0] = ch;
        assert(_PyUnicode_CheckConsistency(unicode, 1));
        unicode_latin1[ch] = unicode;
    }
    Py_INCREF(unicode);
    return unicode;
}

5.2. 常量字串池

a = `hello`
b = `hello`
a is b  #True

由上例可以看出Python對常量字串做了快取。快取的關鍵性實現在PyUnicode_InternInPlace方法中。

5.2.1. PyUnicode_InternInPlace的C呼叫堆疊

// compile.c
assemble
=>makecode
// codeobject.c
=>PyCode_New
=>intern_string_constants
// unicodeobject.c
=>PyUnicode_InternInPlace

5.2.2. PyUnicode_InternInPlace原始碼

// unicodeobject.c
static PyObject *interned = NULL;

void
PyUnicode_InternInPlace(PyObject **p)
{
    PyObject *s = *p;
    PyObject *t;
#ifdef Py_DEBUG
    assert(s != NULL);
    assert(_PyUnicode_CHECK(s));
#else
    if (s == NULL || !PyUnicode_Check(s))
        return;
#endif
    /* If it`s a subclass, we don`t really know what putting
       it in the interned dict might do. */
    if (!PyUnicode_CheckExact(s))
        return;
    if (PyUnicode_CHECK_INTERNED(s))
        return;
    if (interned == NULL) {
        interned = PyDict_New();
        if (interned == NULL) {
            PyErr_Clear(); /* Don`t leave an exception */
            return;
        }
    }
    Py_ALLOW_RECURSION
    t = PyDict_SetDefault(interned, s, s);
    Py_END_ALLOW_RECURSION
    if (t == NULL) {
        PyErr_Clear();
        return;
    }
    if (t != s) {
        Py_INCREF(t);
        Py_SETREF(*p, t);
        return;
    }
    /* The two references in interned are not counted by refcnt.
       The deallocator will take care of this */
    Py_REFCNT(s) -= 2;
    _PyUnicode_STATE(s).interned = SSTATE_INTERNED_MORTAL;
}

其中最關鍵的方法為PyDict_SetDefault，該方法存在於字典物件dictobject.c中。如果沒有相同的key（此處為s），則返回defaultobject（此處也為s），否則如果有相同的key則返回對應的value。所以如果t與s不同，則說明字典中有相應的key，此時將t的計數加1，並且將之前常量字串的物件指向t。

如此一來，常量字串的物件地址就一致了，此時s的計數會被消除，如果s的計數為0，則會被釋放。值得注意的是，常量字串的物件每次仍舊會被多分配一次記憶體，只是如果之前有分配過，且如果此次分配的物件計數為0，則會被釋放。

有些情況下（字串包含非0-9a-zA-Z）不會放到字典裡，這時候可以通過sys.intern進行效能優化：

import sys
a = `啊`
b = `啊`
a is b    # False

a = sys.intern(`啊`)
b = sys.intern(`啊`)
a is b    # True

具體可以參考：memory – What does python sys.intern do, and when should it be used? – Stack Overflow

5.3. 字串物件的特性

支援tp_as_number、tp_as_sequence、tp_as_mapping這三種操作。

5.3.1. 數值操作

// unicodeobject.c
&unicode_as_number,                         /* tp_as_number */

5.3.2. 序列操作

// unicodeobject.c
&unicode_as_sequence,                     /* tp_as_sequence */

// unicodeobject.c
static PySequenceMethods unicode_as_sequence = {
    (lenfunc) unicode_length,       /* sq_length */
    PyUnicode_Concat,           /* sq_concat */
    (ssizeargfunc) unicode_repeat,  /* sq_repeat */
    (ssizeargfunc) unicode_getitem,     /* sq_item */
    0,                  /* sq_slice */
    0,                  /* sq_ass_item */
    0,                  /* sq_ass_slice */
    PyUnicode_Contains,         /* sq_contains */
};

因為沒有實現PySequenceMethods中的設定方法，所以字串不可變。

其中：

unicode_length

len(`hello`)

PyUnicode_Concat

`hello` + `wolrd`

多個字串相加效率低於join，join只分配一次記憶體；

unicode_repeat

`hello`*10

效率要高於同個字串相加；

unicode_getitem：暫時沒有找到相應Python語句；
PyUnicode_Contains

`h` in `hello`

5.3.3. 關聯操作

// unicodeobject.c
&unicode_as_mapping,                        /* tp_as_mapping */

// unicodeobject.c
static PyMappingMethods unicode_as_mapping = {
    (lenfunc)unicode_length,        /* mp_length */
    (binaryfunc)unicode_subscript,  /* mp_subscript */
    (objobjargproc)0,           /* mp_ass_subscript */
};

其中：

unicode_subscript

test = `hello world`
test[1]
test[0:5]

test[1]會走unicode_subscript方法的index分支，test[0:5]會走slice分支；

5.3.4. to string

// unicodeobject.c
unicode_repr,                                   /* tp_repr */
(reprfunc) unicode_str,                         /* tp_str */

5.3.5. hash

// unicodeobject.c
(hashfunc) unicode_hash,                        /* tp_hash*/

5.3.6. 比較

// unicodeobject.c
PyUnicode_RichCompare,                      /* tp_richcompare */

5.3.7. 內建方法

// unicodeobject.c
unicode_methods,                              /* tp_methods */

5.4 參考

Python原始碼剖析

Python 字串 str
2020-03-02
Python字串
Laravel str 字串操作
2018-09-06
Laravel字串
字串函式 parse_str ()
2020-11-20
字串函式
字串函式 str_ireplace ()
2020-11-20
字串函式
04 #### `__str__` ，輸出物件
2024-09-27
物件
Python - 基本資料型別_str 字串
2021-07-18
Python資料型別字串
Python str字串實用小案例分享！
2023-02-23
Python字串
字串 reverse(str.begin(),str.end()) 函式的標頭檔案以及 str.clear()函式
2020-10-07
字串函式
跟著大彬讀原始碼 - Redis 7 - 物件編碼之簡單動態字串
2019-07-29
原始碼Redis物件字串
JS json字串轉物件、物件轉字串
2019-01-29
JSON字串物件
Javascript 物件 – 字串物件
2018-10-28
JavaScript物件字串
字串物件
2020-12-06
字串物件
Redis原始碼閱讀：Redis字串SDS
2018-06-21
Redis原始碼字串
php中的chunk_split()和str_split()字串函式
2021-09-09
PHP字串函式
Redis原始碼閱讀：sds字串實現
2018-04-04
Redis原始碼字串
objc原始碼解析-ObjectiveC物件結構
2019-04-09
原始碼Object物件
7.Python3原始碼—Dict物件
2018-06-06
Python原始碼物件
物件屬性讀取（核心原始碼）
2019-11-06
物件原始碼
OC原始碼剖析物件的本質
2021-09-25
原始碼物件
js jquery 列印物件；json 物件轉字串
2019-03-06
jQuery物件JSON字串
Python3之字串str、列表list、元組tuple的切片操作
2018-09-01
Python字串
js物件轉json字串
2021-09-11
物件JSON字串
python字串是物件嗎
2021-09-11
Python字串物件
rust 中 str 與 String; &str &String
2023-05-16
Rust
Redis原始碼之SDS簡單動態字串
2023-04-11
Redis原始碼字串
3.Python3原始碼—整數物件
2018-06-06
Python原始碼物件
【OpenFeign】@FeignClient 代理物件的建立原始碼分析
2024-03-23
client物件原始碼
建立最簡單的物件（c 原始碼）
2019-11-04
物件原始碼
Qt原始碼閱讀(三) 物件樹管理
2023-03-29
QT原始碼物件
C++(STL原始碼)：37---仿函式(函式物件)原始碼剖析
2020-04-14
C++原始碼函式物件
Redis的字串物件筆記
2019-04-08
Redis字串物件筆記
C語言字串工具箱DIY之剔除字串首尾的空白字元的str_trim函式
2018-04-03
C語言字串字元函式
Rust中 String、str、&str、char 的區別
2024-07-10
Rust
290.單詞模式。給定一種 pattern(模式) 和一個字串 str ，判斷 str 是否遵循相同的模式。（c++方法）
2018-11-18
模式字串C++
2. Python3原始碼—浮點物件
2019-02-16
Python原始碼物件
Qt原始碼解析——元物件系統熱身
2023-11-10
QT原始碼物件
Python物件初探（《Python原始碼剖析》筆記一）
2019-01-30
Python物件原始碼筆記
Hellohao全網物件儲存圖床原始碼
2021-10-03
物件圖床原始碼