Python中的字串物件(《Python原始碼剖析》筆記三)

鬆直發表於2017-07-03
這是我的關於《Python原始碼剖析》一書的筆記的第三篇。Learn Python by Analyzing Python Source Code · GitBook

Python中的字串物件

在Python3中,str型別就是Python2的unicode型別,之前的str型別轉化成了一個新的bytes型別。我們可以分析bytes型別的實現,也就是《Python原始碼剖析》中的內容,但鑑於我們對str型別的常用程度,且我們對它較淺的理解,所以我們來剖析一下這個相較而言複雜得多的型別。

在之前的分析中,Python2中的整數物件是定長物件,而字串物件則是變長物件。同時字串物件又是一個不可變物件,建立之後就無法再改變它的值。

Unicode的四種形式

在Python3中,一個unicode字串有四種形式:

  1. compact ascii

  2. compact

  3. legacy string, not ready

  4. legacy string ,ready

compact的意思是,假如一個字串物件是compact的模式,它將只使用一個記憶體塊來儲存內容,也就是說,在記憶體中字元是緊緊跟在結構體後面的。對於non-compact的物件來說,也就是PyUnicodeObject,Python使用一個記憶體塊來儲存PyUnicodeObject結構體,另一個記憶體塊來儲存字元。

對於ASCII-only的字串,Python使用PyUnicode_New來建立,並將其儲存在PyASCIIObject結構體中。只要它是通過UTF-8來解碼的,utf-8字串就是資料本身,也就是說兩者等價。

legacy string 是通過PyUnicodeObject來儲存的。

我們先看原始碼,然後再敘述其他內容。

typedef struct {
    PyObject_HEAD
    Py_ssize_t length;          /* Number of code points in the string */
    Py_hash_t hash;             /* Hash value; -1 if not set */
    struct {
        unsigned int interned:2;
        unsigned int kind:3;
        unsigned int compact:1;
        unsigned int ascii:1;
        unsigned int ready:1;       
        unsigned int :24;
    } state;
    wchar_t *wstr;              /* wchar_t representation (null-terminated) */
} PyASCIIObject;
​
typedef struct {
    PyASCIIObject _base;
    Py_ssize_t utf8_length;     /* Number of bytes in utf8, excluding the
                                 * terminating \0. */
    char *utf8;                 /* UTF-8 representation (null-terminated) */
    Py_ssize_t wstr_length;     /* Number of code points in wstr, possible
                                 * surrogates count as two code points. */
} PyCompactUnicodeObject;
​
typedef struct {
    PyCompactUnicodeObject _base;
    union {
        void *any;
        Py_UCS1 *latin1;
        Py_UCS2 *ucs2;
        Py_UCS4 *ucs4;
    } data;                     /* Canonical, smallest-form Unicode buffer */
} PyUnicodeObject;複製程式碼

可以看出,整個字串物件機制以PyASCIIObject為基礎,我們就先來看這個物件。length中儲存了字串中code points的數量。hash中則儲存了字串的hash值,因為一個字串物件是不可變物件,它的hash值永遠不會改變,因此Python將其快取在hash變數中,防止重複計算帶來的效能損失。state結構體中儲存了關於這個物件的一些資訊,它們和我們之前介紹的字串的四種形式有關。wstr變數則是字串物件真正的值所在。

state結構體中的變數都是什麼意思?為了節省篇幅,我將註釋刪除了,我們來一一解釋。interned變數的值和字串物件的intern機制有關,它可以有三個值:SSTATE_NOT_INTERNED (0),SSTATE_INTERNED_MORTAL (1),SSTATE_INTERNED_IMMORTAL (2)。分別表示不intern,intern但可刪除,永久intern。具體的機制我們後面會說。kind主要是表示字串以幾位元組的形式儲存。compact我們已經解釋,ascii也很好理解。ready則是用來說明物件的佈局是否被初始化。如果是1,就說明要麼這個物件是緊湊的(compact),要麼它的資料指標已經被填滿了。

我們前面提到,一個ASCII字串使用PyUnicode_New來建立,並儲存在PyASCIIObject結構體中。同樣使用PyUnicode_New建立的字串物件,如果是非ASCII字串,則儲存在PyCompactUnicodeObject結構體中。一個PyUnicodeObject通過PyUnicode_FromUnicode(NULL, len)建立,真正的字串資料一開始儲存在wstr block中,然後使用_PyUnicode_Ready被複制到了data block中。

我們再來看一下PyUnicode_Type:

PyTypeObject PyUnicode_Type = {
    PyVarObject_HEAD_INIT(&PyType_Type, 0)
    "str",              /* tp_name */
    sizeof(PyUnicodeObject),        /* tp_size */
    ……
    unicode_repr,           /* tp_repr */
    &unicode_as_number,         /* tp_as_number */
    &unicode_as_sequence,       /* tp_as_sequence */
    &unicode_as_mapping,        /* tp_as_mapping */
    (hashfunc) unicode_hash,        /* tp_hash*/
    ……
};複製程式碼

可以看出,Python3中的str的確就是之前的unicode。

建立字串物件

PyObject *PyUnicode_FromUnicode(const Py_UNICODE *u, Py_ssize_t size)
{
    PyObject *unicode;
    Py_UCS4 maxchar = 0;
    Py_ssize_t num_surrogates;
​
    if (u == NULL)
        return (PyObject*)_PyUnicode_New(size);
​
    /* If the Unicode data is known at construction time, we can apply
       some optimizations which share commonly used objects. *//* Optimization for empty strings */
    if (size == 0)
        _Py_RETURN_UNICODE_EMPTY();
​
    /* Single character Unicode objects in the Latin-1 range are
       shared when using this constructor */
    if (size == 1 && (Py_UCS4)*u < 256)
        return get_latin1_char((unsigned char)*u);
​
    /* If not empty and not single character, copy the Unicode data
       into the new object */
    if (find_maxchar_surrogates(u, u + size,
                                &maxchar, &num_surrogates) == -1)
        return NULL;
​
    unicode = PyUnicode_New(size - num_surrogates, maxchar);
    if (!unicode)
        return NULL;
​
    switch (PyUnicode_KIND(unicode)) {
    case PyUnicode_1BYTE_KIND:
        _PyUnicode_CONVERT_BYTES(Py_UNICODE, unsigned char,
                                u, u + size, PyUnicode_1BYTE_DATA(unicode));
        break;
    case PyUnicode_2BYTE_KIND:
#if Py_UNICODE_SIZE == 2
        memcpy(PyUnicode_2BYTE_DATA(unicode), u, size * 2);
#else
        _PyUnicode_CONVERT_BYTES(Py_UNICODE, Py_UCS2,
                                u, u + size, PyUnicode_2BYTE_DATA(unicode));
#endif
        break;
    case PyUnicode_4BYTE_KIND:
#if SIZEOF_WCHAR_T == 2
        /* This is the only case which has to process surrogates, thus
           a simple copy loop is not enough and we need a function. */
        unicode_convert_wchar_to_ucs4(u, u + size, unicode);
#else
        assert(num_surrogates == 0);
        memcpy(PyUnicode_4BYTE_DATA(unicode), u, size * 4);
#endif
        break;
    default:
        assert(0 && "Impossible state");
    }
​
    return unicode_result(unicode);
}
PyObject *PyUnicode_New(Py_ssize_t size, Py_UCS4 maxchar)
{
    PyObject *obj;
    PyCompactUnicodeObject *unicode;
    void *data;
    enum PyUnicode_Kind kind;
    int is_sharing, is_ascii;
    Py_ssize_t char_size;
    Py_ssize_t struct_size;
​
    /* Optimization for empty strings */
    if (size == 0 && unicode_empty != NULL) {
        Py_INCREF(unicode_empty);
        return unicode_empty;
    }
​
    is_ascii = 0;
    is_sharing = 0;
    struct_size = sizeof(PyCompactUnicodeObject);
    if (maxchar < 128) {
        kind = PyUnicode_1BYTE_KIND;
        char_size = 1;
        is_ascii = 1;
        struct_size = sizeof(PyASCIIObject);
    }
    else if (maxchar < 256) {
        kind = PyUnicode_1BYTE_KIND;
        char_size = 1;
    }
    else if (maxchar < 65536) {
        kind = PyUnicode_2BYTE_KIND;
        char_size = 2;
        if (sizeof(wchar_t) == 2)
            is_sharing = 1;
    }
    else {
        if (maxchar > MAX_UNICODE) {
            PyErr_SetString(PyExc_SystemError,
                            "invalid maximum character passed to PyUnicode_New");
            return NULL;
        }
        kind = PyUnicode_4BYTE_KIND;
        char_size = 4;
        if (sizeof(wchar_t) == 4)
            is_sharing = 1;
    }
​
    /* Ensure we won't overflow the size. */
    if (size < 0) {
        PyErr_SetString(PyExc_SystemError,
                        "Negative size passed to PyUnicode_New");
        return NULL;
    }
    if (size > ((PY_SSIZE_T_MAX - struct_size) / char_size - 1))
        return PyErr_NoMemory();
​
    /* Duplicated allocation code from _PyObject_New() instead of a call to
     * PyObject_New() so we are able to allocate space for the object and
     * it's data buffer.
     */
    obj = (PyObject *) PyObject_MALLOC(struct_size + (size + 1) * char_size);
    if (obj == NULL)
        return PyErr_NoMemory();
    obj = PyObject_INIT(obj, &PyUnicode_Type);
    if (obj == NULL)
        return NULL;
​
    unicode = (PyCompactUnicodeObject *)obj;
    if (is_ascii)
        data = ((PyASCIIObject*)obj) + 1;
    else
        data = unicode + 1;
    _PyUnicode_LENGTH(unicode) = size;
    _PyUnicode_HASH(unicode) = -1;
    _PyUnicode_STATE(unicode).interned = 0;
    _PyUnicode_STATE(unicode).kind = kind;
    _PyUnicode_STATE(unicode).compact = 1;
    _PyUnicode_STATE(unicode).ready = 1;
    _PyUnicode_STATE(unicode).ascii = is_ascii;
    if (is_ascii) {
        ((char*)data)[size] = 0;
        _PyUnicode_WSTR(unicode) = NULL;
    }
    else if (kind == PyUnicode_1BYTE_KIND) {
        ((char*)data)[size] = 0;
        _PyUnicode_WSTR(unicode) = NULL;
        _PyUnicode_WSTR_LENGTH(unicode) = 0;
        unicode->utf8 = NULL;
        unicode->utf8_length = 0;
    }
    else {
        unicode->utf8 = NULL;
        unicode->utf8_length = 0;
        if (kind == PyUnicode_2BYTE_KIND)
            ((Py_UCS2*)data)[size] = 0;
        else /* kind == PyUnicode_4BYTE_KIND */
            ((Py_UCS4*)data)[size] = 0;
        if (is_sharing) {
            _PyUnicode_WSTR_LENGTH(unicode) = size;
            _PyUnicode_WSTR(unicode) = (wchar_t *)data;
        }
        else {
            _PyUnicode_WSTR_LENGTH(unicode) = 0;
            _PyUnicode_WSTR(unicode) = NULL;
        }
    }
#ifdef Py_DEBUG
    unicode_fill_invalid((PyObject*)unicode, 0);
#endif
    assert(_PyUnicode_CheckConsistency((PyObject*)unicode, 0));
    return obj;
}複製程式碼

先來分析PyUnicode_FromUnicode的流程。如果傳入的u是個空指標,呼叫_PyUnicode_New(size)直接返回一個指定大小但值為空的PyUnicodeObject物件。如果size==0,呼叫_Py_RETURN_UNICODE_EMPTY()直接返回。如果是在Latin-1範圍內的單字元字串,直接返回該字元對應的PyUnicodeObject,這和我們在上一章說的小整數物件池類似,這裡也有一個字元緩衝池。如果兩者都不是,則建立一個新的物件並將資料複製到這個物件中。

PyUnicode_New的流程很好理解,傳入物件的大小和maxchar,根據這兩個引數來決定返回的是PyASCIIObject,PyCompactUnicodeObject還是PyUnicodeObject。

Intern機制

我們之前提到了intern機制,它指的就是在建立一個新的字串物件時,如果已經有了和它的值相同的字串物件,那麼就直接返回那個物件的引用,而不返回新建立的字串物件。Python在那裡尋找呢?事實上,python維護著一個鍵值對型別的結構interned,鍵就是字串的值。但這個intern機制並非對於所有的字串物件都適用,簡單來說對於那些符合python識別符號命名原則的字串,也就是隻包括字母數字下劃線的字串,python會對它們使用intern機制。在標準庫中,有一個函式可以讓我們對一個字串強制實行這個機制——sys.intern(),下面是這個函式的文件:

Enter string in the table of “interned” strings and return the interned string – which is string itself or a copy. Interning strings is useful to gain a little performance on dictionary lookup – if the keys in a dictionary are interned, and the lookup key is interned, the key comparisons (after hashing) can be done by a pointer compare instead of a string compare. Normally, the names used in Python programs are automatically interned, and the dictionaries used to hold module, class or instance attributes have interned keys.

Interned strings are not immortal; you must keep a reference to the return value of intern() around to benefit from it.

具體機制見下面程式碼:

PyObject *PyUnicode_InternFromString(const char *cp)
{
    PyObject *s = PyUnicode_FromString(cp);
    if (s == NULL)
        return NULL;
    PyUnicode_InternInPlace(&s);
    return s;
}複製程式碼
void PyUnicode_InternInPlace(PyObject **p)
{
    PyObject *s = *p;
    PyObject *t;
#ifdef Py_DEBUG
    assert(s != NULL);
    assert(_PyUnicode_CHECK(s));
#else
    if (s == NULL || !PyUnicode_Check(s))
        return;
#endif
    /* If it's a subclass, we don't really know what putting
       it in the interned dict might do. */
    if (!PyUnicode_CheckExact(s))
        return;
    if (PyUnicode_CHECK_INTERNED(s))
        return;
    if (interned == NULL) {
        interned = PyDict_New();
        if (interned == NULL) {
            PyErr_Clear(); /* Don't leave an exception */
            return;
        }
    }
    Py_ALLOW_RECURSION
    t = PyDict_SetDefault(interned, s, s);
    Py_END_ALLOW_RECURSION
    if (t == NULL) {
        PyErr_Clear();
        return;
    }
    if (t != s) {
        Py_INCREF(t);
        Py_SETREF(*p, t);
        return;
    }
    /* The two references in interned are not counted by refcnt.
       The deallocator will take care of this */
    Py_REFCNT(s) -= 2;
    _PyUnicode_STATE(s).interned = SSTATE_INTERNED_MORTAL;
}複製程式碼

當Python呼叫PyUnicode_InternFromString時,會返回一個interned的物件,具體過程由PyUnicode_InternInPlace來實現。

事實上,即使Python會對一個字串進行intern操作,它也會先建立出一個PyUnicodeObject物件,之後再檢查是否有值和其相同的物件。如果有的話,就將interned中儲存的物件返回,之前新建立出來的,因為引用計數變為零,被回收了。

被intern機制處理後的物件分為兩類:mortal和immortal,前者會被回收,後者則不會被回收,與Python虛擬機器共存亡。

PyUnicodeObject有關的效率問題

在《Python原始碼剖析》原書中提到使用+來連線字串是一個極其低效的操作,因為每次連線都會建立一個新的字串物件,推薦使用字串的join方法來連線字串。在Python3.6下,經過我的測試,使用+來連線字串已經和使用join的耗時相差不大。當然這只是我在個別環境下的測試,真正的答案我還不知道。

小結

在Python3中,str底層實現使用unicode,這很好的解決了Python2中複雜麻煩的非ASCII字串的種種問題。同時在底層,Python對於ASCII和非ASCII字串區別對待,加上utf-8相容ASCII字元,兼顧了效能和簡單程度。在Python中,不可變物件往往都有類似intern機制的東西,這使得Python減少了不必要的記憶體消耗,但是在真正的實現中,Python也是取平衡點。因為,一味使用intern機制,有可能會造成額外的計算和查詢,這就和優化效能的目的背道而馳了。


相關文章