Python原始碼分析-PyDictObject

發表於2016-09-27

目前Cpython使用最多，下面分析下python中字典的原始碼實現

資料結構

1. PyDictObject

PyDictObject是python字典對應的C物件，本質上是一個hash表基本元素的組合，包含3個元素：

一個table（可以看成是一個陣列）
hash函式
表格中的每一項：entry

typedef struct _dictobject PyDictObject;
struct _dictobject {
    PyObject_HEAD
    Py_ssize_t ma_fill;  /* # Active + # Dummy */
    Py_ssize_t ma_used;  /* # Active */

    /* The table contains ma_mask + 1 slots, and that's a power of 2.
     * We store the mask instead of the size because the mask is more
     * frequently needed.
     */
    Py_ssize_t ma_mask;

    /* ma_table points to ma_smalltable for small tables, else to
     * additional malloc'ed memory.  ma_table is never NULL!  This rule
     * saves repeated runtime null-tests in the workhorse getitem and
     * setitem calls.
     */
    PyDictEntry *ma_table;
    PyDictEntry *(*ma_lookup)(PyDictObject *mp, PyObject *key, long hash);
    PyDictEntry ma_smalltable[PyDict_MINSIZE];
};

typedef struct _dictobject PyDictObject;

struct _dictobject {

PyObject_HEAD

Py_ssize_t ma_fill; /* # Active + # Dummy */

Py_ssize_t ma_used; /* # Active */

/* The table contains ma_mask + 1 slots, and that's a power of 2.

* We store the mask instead of the size because the mask is more

* frequently needed.

Py_ssize_t ma_mask;

/* ma_table points to ma_smalltable for small tables, else to

* additional malloc'ed memory. ma_table is never NULL! This rule

* saves repeated runtime null-tests in the workhorse getitem and

* setitem calls.

PyDictEntry *ma_table;

PyDictEntry *(*ma_lookup)(PyDictObject *mp, PyObject *key, long hash);

PyDictEntry ma_smalltable[PyDict_MINSIZE];

};

PyDictObject包含了一個PyObject_HEAD, 任何python的物件都含他的指標。PyObject_HEAD包含一個雙向連結串列, 一個引用計算器, 一個物件描述(typeobject)。這個物件其實主要的作用是垃圾回收。
ma_table 和ma_smalltable對應的是hash表中的table，但這裡為啥有兩個table呢？因為Python原始碼中使用了大量PyDictOject，但是dict中元素的數量一般比較少，為了方便，每次建立該物件時都會建立Pydict_MINISIZE個entry空間。當table中元素的個數超過一定數量時就會自動調整table的長度。所以，ma_table初始時等於ma_smalltable，當entry個數增加時，會調整 ma_table的長度。

Py_ssize_t ma_mask是用於計算hash值的，它的值等於table的長度減一。這個屬性的理解非常重要，直接關係到是否能完全理Python的雜湊函式以及hash值的計算。Python字典的雜湊函式非常簡單，如下：

ma_mask = len(table) - 1 # table的長度必須是2的N次方，所以ma_mask肯定是奇數
index = key & ma_mask  #等同於 index = key % len(table) ；  index是表格中的位置，那麼 key是怎麼來的，這是關鍵，後續介紹

1 2	ma_mask = len(table) - 1 # table的長度必須是2的N次方，所以ma_mask肯定是奇數 index = key & ma_mask #等同於 index = key % len(table) ； index是表格中的位置，那麼 key是怎麼來的，這是關鍵，後續介紹

ma_lookup 函式用於根據 key查詢 val。既然hash函式這麼簡單，那麼為什麼還需一個特殊的查詢函式呢？因為table中的entry不是簡單的一個數字或者字串，而是一個物件PyDictEntry，這個物件有自己的生命週期，所以i在查詢時稍微複雜一點。

ma_fill與ma_used：上面說過PyDictEntry有自己的生命週期，包括3個狀態：unused，active, dummy。ma_fill表示table中已使用的個數（=active+dummy），active表示當前正在使用的個數，dummy表示插入以後刪除的個數。

#code: python
d = {'name': 'wxg', 'age': 23, 'sex': 'male'}   # unused=5(預設Pydict_MINISIZE=8)， active=3, dummy=0
del d['sex'] # unused=5, active=2, dummy=1

#code: python

d = {'name': 'wxg', 'age': 23, 'sex': 'male'} # unused=5(預設Pydict_MINISIZE=8)， active=3, dummy=0

del d['sex'] # unused=5, active=2, dummy=1

2. PyDictEntry

PyDictEntry是table中的具體元素項。

typedef struct {
    /* Cached hash code of me_key.  Note that hash codes are C longs.
     * We have to use Py_ssize_t instead because dict_popitem() abuses
     * me_hash to hold a search finger.
     */
    Py_ssize_t me_hash;
    PyObject *me_key;
    PyObject *me_value;
} PyDictEntry;

typedef struct {

/* Cached hash code of me_key. Note that hash codes are C longs.

* We have to use Py_ssize_t instead because dict_popitem() abuses

* me_hash to hold a search finger.

Py_ssize_t me_hash;

PyObject *me_key;

PyObject *me_value;

} PyDictEntry;

me_hash是hash值， me_key是儲存的物件（可以是任意型別，因為python中一切皆物件，這些物件都是PyObject），me_value是儲存的值。

hash函式分析

理解一個hash表的實現，最重要的是理解其中的hash函式的實現，以及發生碰撞時的解決方法。

1. hash函式的實現

上面介紹過hash函式的實現

ma_mask = len(table) - 1 # table的長度必須是2的N次方，所以ma_mask肯定是奇數
  index = key & ma_mask  # key是怎麼來的，這是關鍵，後續介紹
  #設 d =  {'name': 'wxg'}
  key = get_key('name')   # 下面介紹 get_key 是怎麼實現的。

ma_mask = len(table) - 1 # table的長度必須是2的N次方，所以ma_mask肯定是奇數

index = key & ma_mask # key是怎麼來的，這是關鍵，後續介紹

#設 d = {'name': 'wxg'}

key = get_key('name') # 下面介紹 get_key 是怎麼實現的。

PyDictObject本身的hash函式很簡單，因為key是經過一次hash的值，即get_key函式就是獲取一個物件（包括字串，整數和更復雜物件）的hash值。Python原始碼中的原型如下：

long
PyObject_Hash(PyObject *v)
{
 PyTypeObject *tp = v->ob_type;
 #1. 獲取該物件的型別，然後呼叫該型別的tp_hash函式獲取該物件的hash值
 if (tp->tp_hash != NULL)
     return (*tp->tp_hash)(v);
 /* To keep to the general practice that inheriting
  * solely from object in C code should work without
  * an explicit call to PyType_Ready, we implicitly call
  * PyType_Ready here and then check the tp_hash slot again
  */
 if (tp->tp_dict == NULL) {
     if (PyType_Ready(tp) < 0)
         return -1;
     if (tp->tp_hash != NULL)
         return (*tp->tp_hash)(v);
 }
 #2. 如果該型別沒有tp_hash函式，就使用該物件的記憶體地址作為hash值
 if (tp->tp_compare == NULL && RICHCOMPARE(tp) == NULL) {
     return _Py_HashPointer(v); /* Use address as hash value */
 }
 /* If there's a cmp but no hash defined, the object can't be hashed */
 return PyObject_HashNotImplemented(v);
}

long

PyObject_Hash(PyObject *v)

{

PyTypeObject *tp = v->ob_type;

#1. 獲取該物件的型別，然後呼叫該型別的tp_hash函式獲取該物件的hash值

if (tp->tp_hash != NULL)

return (*tp->tp_hash)(v);

/* To keep to the general practice that inheriting

* solely from object in C code should work without

* an explicit call to PyType_Ready, we implicitly call

* PyType_Ready here and then check the tp_hash slot again

if (tp->tp_dict == NULL) {

if (PyType_Ready(tp) < 0)

return -1;

if (tp->tp_hash != NULL)

return (*tp->tp_hash)(v);

}

#2. 如果該型別沒有tp_hash函式，就使用該物件的記憶體地址作為hash值

if (tp->tp_compare == NULL && RICHCOMPARE(tp) == NULL) {

return _Py_HashPointer(v); /* Use address as hash value */

}

/* If there's a cmp but no hash defined, the object can't be hashed */

return PyObject_HashNotImplemented(v);

}

舉例分析怎麼獲取string物件的hash

string物件的hash值獲取，先看string物件的定義

typedef struct {
 PyObject_VAR_HEAD
 long ob_shash;
 int ob_sstate;
 char ob_sval[1];

 /* Invariants:
  *     ob_sval contains space for 'ob_size+1' elements.
  *     ob_sval[ob_size] == 0.
  *     ob_shash is the hash of the string or -1 if not computed yet.
  *     ob_sstate != 0 iff the string object is in stringobject.c's
  *       'interned' dictionary; in this case the two references
  *       from 'interned' to this object are *not counted* in ob_refcnt.
  */
} PyStringObject;

typedef struct {

PyObject_VAR_HEAD

long ob_shash;

int ob_sstate;

char ob_sval[1];

/* Invariants:

* ob_sval contains space for 'ob_size+1' elements.

* ob_sval[ob_size] == 0.

* ob_shash is the hash of the string or -1 if not computed yet.

* ob_sstate != 0 iff the string object is in stringobject.c's

* 'interned' dictionary; in this case the two references

* from 'interned' to this object are *not counted* in ob_refcnt.

} PyStringObject;

每個string物件有一個 ob_shash，這個值就是該string的hash值。這個值就是通過tp_hash獲取的。具體可以參考原始碼Object/stringobject.c中的 static long string_hash()函式

綜上：hash函式進行雜湊之前，會先獲取每個物件的hash值，如果該物件有實現tp_hash函式，就呼叫該函式，如果沒有就使用該物件的記憶體地址的值作為hash值，然後用該值對 ma_mask取餘獲取該物件儲存到table中的位置。

2. 碰撞時的解決方式

hash雜湊發生碰撞的解決方法主要有：

開放地址法，
再雜湊法，
鏈地址法等等。

python字典中使用的是再雜湊法，函式如下：

 j = (5*j) + 1 + perturb;
 perturb >>= PERTURB_SHIFT(default=5);
 use j % 2**i as the next table index;

j = (5*j) + 1 + perturb;

perturb >>= PERTURB_SHIFT(default=5);

use j % 2**i as the next table index;

其中，perturb初始值是物件的hash值，

3. table大小的重新調整

什麼時候需要重新調整table的大小呢， hash表的效能主要表現在裝填因子上，

雜湊表的裝填因子定義為：α= 填入表中的元素個數 / 雜湊表的長度

1	雜湊表的裝填因子定義為：α= 填入表中的元素個數 / 雜湊表的長度

python的字典實現中，當裝填因子大於 2/3 時就進行重現調整table的大小，調整的過程其實就是新開闢一個計算得出的新大小的table空間，然後將舊table中的entry重新計算寫入新table中。

 /* * If fill >= 2/3 size, adjust size.  Normally, this doubles or
     * quaduples the size, but it's also possible for the dict to shrink
     * (if ma_fill is much larger than ma_used, meaning a lot of dict
     * keys have been * deleted).
     *
     * Quadrupling the size improves average dictionary sparseness
     * (reducing collisions) at the cost of some memory and iteration
     * speed (which loops over every possible entry).  It also halves
     * the number of expensive resize operations in a growing dictionary.
     *
     * Very large dictionaries (over 50K items) use doubling instead.
     * This may help applications with severe memory constraints.
     */

/* * If fill >= 2/3 size, adjust size. Normally, this doubles or

* quaduples the size, but it's also possible for the dict to shrink

* (if ma_fill is much larger than ma_used, meaning a lot of dict

* keys have been * deleted).

* Quadrupling the size improves average dictionary sparseness

* (reducing collisions) at the cost of some memory and iteration

* speed (which loops over every possible entry). It also halves

* the number of expensive resize operations in a growing dictionary.

* Very large dictionaries (over 50K items) use doubling instead.

* This may help applications with severe memory constraints.

值得注意的是：上面提到的fill 是 ma_fill(ma_fill=active+dummy)。也就是說這個裝填因子的計算考慮到了那些delete 的物件，就是刪除了，仍然計算在內。

PyDictObject物件的建立，插入與刪除

這部分內容比較簡單，直接看原始碼就行，後面再分析

newrelic python agent 原始碼分析-1
2019-02-28
Python原始碼
Retrofit原始碼分析三原始碼分析
2018-05-17
原始碼
Python執行緒池ThreadPoolExecutor原始碼分析
2019-07-15
Python執行緒thread原始碼
集合原始碼分析[2]-AbstractList 原始碼分析
2019-04-11
原始碼
集合原始碼分析[3]-ArrayList 原始碼分析
2019-04-12
原始碼
Guava 原始碼分析之 EventBus 原始碼分析
2018-08-01
Guava原始碼
【JDK原始碼分析系列】ArrayBlockingQueue原始碼分析
2020-09-27
JDK原始碼BloC
集合原始碼分析[1]-Collection 原始碼分析
2019-03-23
原始碼
Android 原始碼分析之 AsyncTask 原始碼分析
2019-03-04
Android原始碼
以太坊原始碼分析(36)ethdb原始碼分析
2018-05-14
原始碼
以太坊原始碼分析(38）event原始碼分析
2018-05-14
原始碼
以太坊原始碼分析(41）hashimoto原始碼分析
2018-05-14
原始碼
以太坊原始碼分析(43）node原始碼分析
2018-05-14
原始碼
以太坊原始碼分析(51）rpc原始碼分析
2018-05-14
原始碼RPC
以太坊原始碼分析(52）trie原始碼分析
2018-05-14
原始碼
深度 Mybatis 3 原始碼分析（一）SqlSessionFactoryBuilder原始碼分析
2019-06-06
MyBatis原始碼SQLSessionUI
k8s client-go原始碼分析 informer原始碼分析(6)-Indexer原始碼分析
2022-06-19
K8SclientGo原始碼ORMIndex
k8s client-go原始碼分析 informer原始碼分析(4)-DeltaFIFO原始碼分析
2022-05-22
K8SclientGo原始碼ORM
5.2 spring5原始碼--spring AOP原始碼分析三---切面原始碼分析
2021-02-11
Spring原始碼
Python 日誌列印之logging.getLogger原始碼分析
2021-01-17
Python原始碼
Spring原始碼分析——搭建spring原始碼
2021-09-12
Spring原始碼
以太坊原始碼分析(35)eth-fetcher原始碼分析
2018-05-14
原始碼
以太坊原始碼分析(20)core-bloombits原始碼分析
2018-05-14
原始碼OOM
以太坊原始碼分析(24)core-state原始碼分析
2018-05-14
原始碼
以太坊原始碼分析(29)core-vm原始碼分析
2018-05-14
原始碼
以太坊原始碼分析(34)eth-downloader原始碼分析
2018-05-14
原始碼
精盡MyBatis原始碼分析 - MyBatis-Spring 原始碼分析
2020-11-27
MyBatis原始碼Spring
k8s client-go原始碼分析 informer原始碼分析(5)-Controller&Processor原始碼分析
2022-06-05
K8SclientGo原始碼ORMController
SocketServer 原始碼分析
2019-02-16
Server原始碼
React 原始碼分析
2018-10-25
React原始碼
Dialog原始碼分析
2018-10-17
原始碼
Axios原始碼分析
2019-01-13
iOS原始碼
[原始碼分析]ArrayList
2019-02-22
原始碼
CAS原始碼分析
2019-03-04
原始碼
preact原始碼分析
2019-04-07
React原始碼
httprouter 原始碼分析
2019-04-17
HTTP原始碼
retrofit 原始碼分析
2019-01-19
原始碼
LeakCanary 原始碼分析
2018-12-11
原始碼
redux原始碼分析
2019-03-01
Redux原始碼