8.Python3原始碼—Code物件與pyc檔案

whj0709發表於2018-06-06

8.1. Python程式的執行過程

Python直譯器在執行任何一個Python程式檔案時,首先進行的動作都是先對檔案中的Python原始碼進行編譯,編譯的主要結果是產生一組Python的byte code(位元組碼),然後將編譯的結果交給Python的虛擬機器(Virtual Machine),由虛擬機器按照順序一條一條地執行位元組碼,從而完成對Python程式的執行動作。

對於Python編譯器來說,PyCodeObject物件才是其真正的編譯結果,而pyc檔案只是這個物件在硬碟上的表現形式,它們實際上是Python對原始檔編譯的結果的兩種不同存在方式。

在程式執行期間,編譯結果存在於記憶體的PyCodeObject物件中;而Python結束執行後,編譯結果又被儲存到了pyc檔案中。當下一次執行相同的程式時,Python會根據pyc檔案中記錄的編譯結果直接建立記憶體中的PyCodeObject物件,而不用再次對原始檔進行編譯了。

對整體流程認識清晰後完全可以寫一個工具,將基於Python3.7生成的pyc檔案解析出來,pyc檔案的內容用json格式組織一下如下圖:

寫工具的目的只是為了更加理解整個流程。實際上使用Python的dis模組可以輸出更為詳細清晰的內容,如下圖:

8.2. PyCodeObject原始碼

// code.h
typedef struct {
    PyObject_HEAD
    int co_argcount;
    int co_kwonlyargcount;
    int co_nlocals;
    int co_stacksize; 
    int co_flags; 
    int co_firstlineno;
    PyObject *co_code;
    PyObject *co_consts;
    PyObject *co_names;
    PyObject *co_varnames;
    PyObject *co_freevars;
    PyObject *co_cellvars;
    Py_ssize_t *co_cell2arg;
    PyObject *co_filename;      
    PyObject *co_name;          
    PyObject *co_lnotab;        
    void *co_zombieframe; 
    PyObject *co_weakreflist;
    void *co_extra;
} PyCodeObject;
  • Code Block:
    Python編譯器在對Python原始碼進行編譯的時候,對於程式碼中的一個Code Block,會建立一個PyCodeObject物件與這段程式碼對應。當進入一個新的名字空間,或者說作用域時,就算是進入了一個新的Code Block了。比如下面的程式碼有三個code block:一個對應整個test.py檔案,一個對應class A,一個對應def Fun。
# test.py
class A:
    pass

def Fun():
    pass

a = A()
Fun()
  • 名字空間:
    名字空間是符號的上下文環境,符號的含義取決於名字空間。更具體地說,一個變數名對應的變數值是什麼,在Python中,這並不是確定的,而是需要通過名字空間來決定。一個Code Block,對應著一個名字空間,它會對應一個PyCodeObject物件。
  • Python中的code物件:
    在Python中,有與C語言下的PyCodeObject物件對應的物件——code物件,這個物件是對C語言下的PyCodeObject物件的一個簡單包裝,通過code物件,我們可以訪問PyCodeObject物件中的各個域。

8.3. 生成pyc檔案

# pyc_generator.py
import imp
import sys

def generate_pyc(name):
    fp, pathname, description = imp.find_module(name)
    try:
        imp.load_module(name, fp, pathname, description)
    finally:
        if fp:
            fp.close()

if __name__ == `__main__`:
    generate_pyc(sys.argv[1])

命令列中輸入如下命令會生成pyc檔案:

>>> ./python3.7 pyc_generator.py test

8.3.1. 生成PyCodeObject物件和pyc檔案的C流程

從上面的pyc_generator檔案中的imp.load_module開始,函式呼叫順序如下:

// imp.py
load_module
=>load_source

// _bootstrap.py[1]
=>_load
=>_load_unlocked

// _bootstrap_external.py
=> exec_module
=> get_code

get_code方法中呼叫source_to_code方法生成PyCodeObject物件,呼叫_code_to_timestamp_pyc將PyCodeObject轉為二進位制資料,呼叫_cache_bytecode方法將二進位制資料寫入檔案。

值得注意的是真正的Python不會呼叫_bootstrap.py的_load方法(上面函式呼叫順序中的[1]),在Lib/importlib/__init__.py中:

# __init__.py
try:
    import _frozen_importlib as _bootstrap
except ImportError:
    from . import _bootstrap
    _bootstrap._setup(sys, _imp)
else:
    # do sth

try:
    import _frozen_importlib_external as _bootstrap_external
except ImportError:
    from . import _bootstrap_external
    _bootstrap_external._setup(_bootstrap)
    _bootstrap._bootstrap_external = _bootstrap_external
else:
   # do sth

可以看到實際上呼叫的是_frozen_importlib中的_load方法,而不是_bootstrap中的_load方法,此lib的內容在Python/importlib.h中被定義:

不太明白為什麼要這麼處理,但是分析整體流程時將此處換成了_bootstrap,便於閱讀原始碼。

下面會詳細分析生成PyCodeObject物件,將PyCodeObject轉為二進位制資料和將二進位制資料寫入檔案的流程。

8.3.2. 生成PyCodeObject物件原始碼

// _bootstrap_external.py
source_to_code

// _bootstrap.py
=>_call_with_frames_removed

// bltinmodule.c
=> builtin_compile_impl

builtin_compile_impl的C原始碼如下:

// bltinmodule.c
static PyObject *
builtin_compile_impl(PyObject *module, PyObject *source, PyObject *filename, const char *mode, int flags, int dont_inherit, int optimize)
{
    PyObject *source_copy;
    const char *str;
    int compile_mode = -1;
    int is_ast;
    PyCompilerFlags cf;
    int start[] = {Py_file_input, Py_eval_input, Py_single_input};
    PyObject *result;

    cf.cf_flags = flags | PyCF_SOURCE_IS_UTF8;

    if (flags &
        ~(PyCF_MASK | PyCF_MASK_OBSOLETE | PyCF_DONT_IMPLY_DEDENT | PyCF_ONLY_AST))
    {
        PyErr_SetString(PyExc_ValueError,
                        "compile(): unrecognised flags");
        goto error;
    }
    /* XXX Warn if (supplied_flags & PyCF_MASK_OBSOLETE) != 0? */

    if (optimize < -1 || optimize > 2) {
        PyErr_SetString(PyExc_ValueError,
                        "compile(): invalid optimize value");
        goto error;
    }

    if (!dont_inherit) {
        PyEval_MergeCompilerFlags(&cf);
    }

    if (strcmp(mode, "exec") == 0)
        compile_mode = 0;
    else if (strcmp(mode, "eval") == 0)
        compile_mode = 1;
    else if (strcmp(mode, "single") == 0)
        compile_mode = 2;
    else {
        PyErr_SetString(PyExc_ValueError,
                        "compile() mode must be `exec`, `eval` or `single`");
        goto error;
    }

    is_ast = PyAST_Check(source);
    if (is_ast == -1)
        goto error;
    if (is_ast) {
        // do sth.
    }

    str = source_as_string(source, "compile", "string, bytes or AST", &cf, &source_copy);
    if (str == NULL)
        goto error;

    result = Py_CompileStringObject(str, filename, start[compile_mode], &cf, optimize);
    Py_XDECREF(source_copy);
    goto finally;

error:
    result = NULL;
finally:
    Py_DECREF(filename);
    return result;
}

其中:

  • 呼叫source_as_string方法將上面的test.py原始碼載入進記憶體:
  • 呼叫Py_CompileStringObject方法生成PyCodeObject物件:
// pythonrun.c
PyObject *
Py_CompileStringObject(const char *str, PyObject *filename, int start,
                       PyCompilerFlags *flags, int optimize)
{
    PyCodeObject *co;
    mod_ty mod;
    PyArena *arena = PyArena_New();
    if (arena == NULL)
        return NULL;

    mod = PyParser_ASTFromStringObject(str, filename, start, flags, arena);
    if (mod == NULL) {
        PyArena_Free(arena);
        return NULL;
    }
    if (flags && (flags->cf_flags & PyCF_ONLY_AST)) {
        PyObject *result = PyAST_mod2obj(mod);
        PyArena_Free(arena);
        return result;
    }
    co = PyAST_CompileObject(mod, filename, flags, optimize, arena);
    PyArena_Free(arena);
    return (PyObject *)co;
}

呼叫PyParser_ASTFromStringObject方法生成語法樹,呼叫PyAST_CompileObject方法生成PyCodeObject物件。此處不對語法解析和編譯做深入分析。

8.3.3. 將PyCodeObject物件轉為二進位制資料

_code_to_timestamp_pyc方法負責將PyCodeObject物件轉為二進位制資料,原始碼如下:

// _bootstrap_external.py
def _code_to_timestamp_pyc(code, mtime=0, source_size=0):
    "Produce the data for a timestamp-based pyc."
    data = bytearray(MAGIC_NUMBER)
    data.extend(_w_long(0))
    data.extend(_w_long(mtime))
    data.extend(_w_long(source_size))
    data.extend(marshal.dumps(code))
    return data

可以看出一個pyc檔案包含幾部分內容:

  • MAGIC_NUMBER:不同版本的Python實現都會定義不同的MAGIC_NUMBER,比如Python 3.7a0 3392,Python 3.6a0 3360,防止載入不相容的pyc檔案;
  • 0:不清楚是用作什麼;
  • mtime:py檔案建立或最近一次修改的時間資訊,如果修改時間沒有改變則不需要轉為二進位制儲存,即不需要修改pyc檔案;
  • source_size:原始碼大小;
  • marshal.dumps(code):PyCodeObject物件的二進位制流;

marshal.dumps呼叫marshal_dumps_impl方法:

// marshal.c
static PyObject *
marshal_dumps_impl(PyObject *module, PyObject *value, int version)
/*[clinic end generated code: output=9c200f98d7256cad input=a2139ea8608e9b27]*/
{
    return PyMarshal_WriteObjectToString(value, version);
}

PyMarshal_WriteObjectToString原始碼為:

// marshal.c
PyObject *
PyMarshal_WriteObjectToString(PyObject *x, int version)
{
    WFILE wf;

    memset(&wf, 0, sizeof(wf));
    wf.str = PyBytes_FromStringAndSize((char *)NULL, 50);
    if (wf.str == NULL)
        return NULL;
    wf.ptr = wf.buf = PyBytes_AS_STRING((PyBytesObject *)wf.str);
    wf.end = wf.ptr + PyBytes_Size(wf.str);
    wf.error = WFERR_OK;
    wf.version = version;
    if (w_init_refs(&wf, version)) {
        Py_DECREF(wf.str);
        return NULL;
    }
    w_object(x, &wf);
    w_clear_refs(&wf);
    if (wf.str != NULL) {
        char *base = PyBytes_AS_STRING((PyBytesObject *)wf.str);
        if (wf.ptr - base > PY_SSIZE_T_MAX) {
            Py_DECREF(wf.str);
            PyErr_SetString(PyExc_OverflowError,
                            "too much marshal data for a bytes object");
            return NULL;
        }
        if (_PyBytes_Resize(&wf.str, (Py_ssize_t)(wf.ptr - base)) < 0)
            return NULL;
    }
    if (wf.error != WFERR_OK) {
        Py_XDECREF(wf.str);
        if (wf.error == WFERR_NOMEMORY)
            PyErr_NoMemory();
        else
            PyErr_SetString(PyExc_ValueError,
              (wf.error==WFERR_UNMARSHALLABLE)?"unmarshallable object"
               :"object too deeply nested to marshal");
        return NULL;
    }
    return wf.str;

此處最關鍵的方法為w_object,該方法會呼叫w_complex_object,真正將PyCodeObject物件轉為二進位制資料就在w_complex_object方法中:

// marshal.c
static void
w_complex_object(PyObject *v, char flag, WFILE *p)
{
    // do sth.
    else if (PyCode_Check(v)) {
        PyCodeObject *co = (PyCodeObject *)v;
        W_TYPE(TYPE_CODE, p);
        w_long(co->co_argcount, p);
        w_long(co->co_kwonlyargcount, p);
        w_long(co->co_nlocals, p);
        w_long(co->co_stacksize, p);
        w_long(co->co_flags, p);
        w_object(co->co_code, p);
        w_object(co->co_consts, p);
        w_object(co->co_names, p);
        w_object(co->co_varnames, p);
        w_object(co->co_freevars, p);
        w_object(co->co_cellvars, p);
        w_object(co->co_filename, p);
        w_object(co->co_name, p);
        w_long(co->co_firstlineno, p);
        w_object(co->co_lnotab, p);
    }
    // do sth.
}

可以看出:

  • PyCodeObject物件的型別是TYPE_CODE,8.2節中的test.py檔案會生成三個PyCodeObject物件,它們之間的關係為一個PyCodeObject物件巢狀兩個PyCodeObject物件;
  • co_argcount、co_kwonlyargcount等欄位是通過呼叫w_long(呼叫w_byte方法寫入四個位元組),co_code、co_consts 等欄位是通過呼叫w_object(實際上是呼叫w_long、w_string等方法),最終轉為二進位制資料的。這些欄位的具體含義之後再進行深入分析;
  • 需要注意的是有一個特殊的型別:TYPE_REF,可以通過該型別節約儲存空間。以co_filename為例,這個欄位的含義為py檔案的完整路徑,下面為test.py生成的pyc檔案中co_filename欄位的值:
// class A
"co_filename": {
    "type": "unicode",
    "size": 49,
    "value": "/Users/l.wang/Documents/pythonindepth/bin/test.py"
}

// def Fun
"co_filename": {
    "type": "ref",
    "ref": 6
}

// test.py
"co_filename": {
    "type": "ref",
    "ref": 6
}

這是通過w_ref方法實現的,w_ref的原始碼如下。其中有一個hash表,該表的key為物件的地址,value為index,如果表中存在相同地址的物件,則寫入TYPE_REF型別和index,從而節省空間。

// marshal.c
static int
w_ref(PyObject *v, char *flag, WFILE *p)
{
    _Py_hashtable_entry_t *entry;
    int w;

    if (p->version < 3 || p->hashtable == NULL) {
        return 0; /* not writing object references */
    }

    /* if it has only one reference, it definitely isn`t shared */
    if (Py_REFCNT(v) == 1) {
        return 0;
    }

    entry = _Py_HASHTABLE_GET_ENTRY(p->hashtable, v);
    if (entry != NULL) {
        /* write the reference index to the stream */
        _Py_HASHTABLE_ENTRY_READ_DATA(p->hashtable, entry, w);
        /* we don`t store "long" indices in the dict */
        assert(0 <= w && w <= 0x7fffffff);
        w_byte(TYPE_REF, p);
        w_long(w, p);
        return 1;
    } else {
        size_t s = p->hashtable->entries;
        /* we don`t support long indices */
        if (s >= 0x7fffffff) {
            PyErr_SetString(PyExc_ValueError, "too many objects");
            goto err;
        }
        w = (int)s;
        Py_INCREF(v);
        if (_Py_HASHTABLE_SET(p->hashtable, v, w) < 0) {
            Py_DECREF(v);
            goto err;
        }
        *flag |= FLAG_REF;
        return 0;
    }
err:
    p->error = WFERR_UNMARSHALLABLE;
    return 1;
}

這個過程的逆序實現過程如下。如果flag不為0,則向list表中增加實際的值。如果型別為TYPE_REF,則根據讀取的index從list表中獲取真實的值。

static PyObject *
r_object(RFILE *p)
{
    PyObject *v, *v2;
    Py_ssize_t idx = 0;
    long i, n;
    int type, code = r_byte(p);
    int flag, is_interned = 0;
    PyObject *retval = NULL;

    if (code == EOF) {
        PyErr_SetString(PyExc_EOFError,
                        "EOF read where object expected");
        return NULL;
    }

    p->depth++;

    if (p->depth > MAX_MARSHAL_STACK_DEPTH) {
        p->depth--;
        PyErr_SetString(PyExc_ValueError, "recursion limit exceeded");
        return NULL;
    }

    flag = code & FLAG_REF;
    type = code & ~FLAG_REF;

#define R_REF(O) do{
    if (flag) 
        O = r_ref(O, flag, p);
} while (0)

    switch (type) {
      // do sth.
      case TYPE_REF:
        n = r_long(p);
        if (n < 0 || n >= PyList_GET_SIZE(p->refs)) {
            if (n == -1 && PyErr_Occurred())
                break;
            PyErr_SetString(PyExc_ValueError, "bad marshal data (invalid reference)");
            break;
        }
        v = PyList_GET_ITEM(p->refs, n);
        if (v == Py_None) {
            PyErr_SetString(PyExc_ValueError, "bad marshal data (invalid reference)");
            break;
        }
        Py_INCREF(v);
        retval = v;
        break;
      // do sth.
      }
}

這裡存在一個問題,為什麼w_ref沒有像r_object中根據flag的值決定哪個欄位寫入hash表中,目前沒有想明白。

8.3.4. 將二進位制資料寫入檔案

_cache_bytecode方法負責將將二進位制資料寫入檔案,原始碼如下:

# _bootstrap_external.py    
def _cache_bytecode(self, source_path, bytecode_path, data):
    # Adapt between the two APIs
    mode = _calc_mode(source_path)
    return self.set_data(bytecode_path, data, _mode=mode)

set_data方法原始碼如下:

    def set_data(self, path, data, *, _mode=0o666):
        """Write bytes data to a file."""
        parent, filename = _path_split(path)
        path_parts = []
        # Figure out what directories are missing.
        while parent and not _path_isdir(parent):
            parent, part = _path_split(parent)
            path_parts.append(part)
        # Create needed directories.
        for part in reversed(path_parts):
            parent = _path_join(parent, part)
            try:
                _os.mkdir(parent)
            except FileExistsError:
                # Probably another Python process already created the dir.
                continue
            except OSError as exc:
                # Could be a permission error, read-only filesystem: just forget
                # about writing the data.
                _bootstrap._verbose_message(`could not create {!r}: {!r}`,
                                            parent, exc)
                return
        try:
            _write_atomic(path, data, _mode)
            _bootstrap._verbose_message(`created {!r}`, path)
        except OSError as exc:
            # Same as above: just don`t write the bytecode.
            _bootstrap._verbose_message(`could not create {!r}: {!r}`, path,
                                        exc)

寫入檔案的關鍵方法為_write_atomic,原始碼如下。該方法採用寫入臨時檔案,而後重新命名的方式,用於保證要麼有異常從而不會生成檔案,要麼無異常生成指定名稱的檔案。

def _write_atomic(path, data, mode=0o666):
    """Best-effort function to write data to a path atomically.
    Be prepared to handle a FileExistsError if concurrent writing of the
    temporary file is attempted."""
    # id() is used to generate a pseudo-random filename.
    path_tmp = `{}.{}`.format(path, id(path))
    fd = _os.open(path_tmp,
                  _os.O_EXCL | _os.O_CREAT | _os.O_WRONLY, mode & 0o666)
    try:
        # We first write data to a temporary file, and then use os.replace() to
        # perform an atomic rename.
        with _io.FileIO(fd, `wb`) as file:
            file.write(data)
        _os.replace(path_tmp, path)
    except OSError:
        try:
            _os.unlink(path_tmp)
        except OSError:
            pass
        raise

8.4. 參考

  • Python原始碼剖析

8.5. 附錄

分析清楚pyc檔案生成的流程後,就可以實現8.1節中提到的工具,工具原始碼如下:

# -*- coding:utf-8 -*-
import json
import datetime
import sys

FLAG_REF = ord(`x80`)
TYPE_CODE = ord(`c`)
TYPE_STRING = ord(`s`)
TYPE_SMALL_TUPLE = ord(`)`)
TYPE_INT = ord(`i`)
TYPE_SHORT_ASCII = ord(`z`)
TYPE_SHORT_ASCII_INTERNED = ord(`Z`)
TYPE_REF = ord(`r`)
TYPE_NONE = ord(`N`)

REFS_HASH = {}

def parse_code(fp):
    code = int.from_bytes(fp.read(1), `little`)
    code_type = code & ~FLAG_REF
    code_flag = code & FLAG_REF

    idx = len(REFS_HASH)
    if code_flag:
        REFS_HASH[idx] = None

    code_dict = {}
    if code_type == TYPE_CODE:
        code_dict[`type`] = `code`
        code_dict[`co_argcount`] = int.from_bytes(fp.read(4), `little`)
        code_dict[`co_kwonlyargcount`] = int.from_bytes(fp.read(4), `little`)
        code_dict[`co_nlocals`] = int.from_bytes(fp.read(4), `little`)
        code_dict[`co_stacksize`] = int.from_bytes(fp.read(4), `little`)
        code_dict[`co_flags`] = int.from_bytes(fp.read(4), `little`)
        code_dict[`co_code`] = parse_code(fp)
        code_dict[`co_consts`] = parse_code(fp)
        code_dict[`co_names`] = parse_code(fp)
        code_dict[`co_varnames`] = parse_code(fp)
        code_dict[`co_freevars`] = parse_code(fp)
        code_dict[`co_cellvars`]  = parse_code(fp)
        code_dict[`co_filename`]  = parse_code(fp)
        code_dict[`co_name`]  = parse_code(fp)
        code_dict[`co_firstlineno`]  = int.from_bytes(fp.read(4), `little`)
        code_dict[`co_lnotab`]  = parse_code(fp)
    elif code_type == TYPE_STRING:
        code_dict[`type`] = `string`

        length = int.from_bytes(fp.read(4), `little`)
        code_dict[`length`] = length

        # todo
        value = fp.read(length)
        code_dict[`value`] = str(value)

        if code_flag:
            REFS_HASH[idx] = code_dict[`value`]
    elif code_type == TYPE_SMALL_TUPLE:
        code_dict[`type`] = `tuple`

        size = int.from_bytes(fp.read(1), `little`)
        code_dict[`size`] = size

        items = []
        for _ in range(size):
            items.append(parse_code(fp))
        code_dict[`items`] = items

        if code_flag:
            REFS_HASH[idx] = code_dict[`items`]
    elif code_type == TYPE_INT:
        code_dict[`type`] = `long`

        value = int.from_bytes(fp.read(4), `little`)
        code_dict[`value`] = value

        if code_flag:
            REFS_HASH[idx] = code_dict[`value`]
    elif code_type == TYPE_SHORT_ASCII:
        code_dict[`type`] = `unicode`

        size = int.from_bytes(fp.read(1), `little`)
        code_dict[`size`] = size

        code_dict[`value`] = fp.read(size).decode()

        if code_flag:
            REFS_HASH[idx] = code_dict[`value`]
    elif code_type == TYPE_SHORT_ASCII_INTERNED:
        code_dict[`type`] = `unicode`

        size = int.from_bytes(fp.read(1), `little`)
        code_dict[`size`] = size

        code_dict[`value`] = fp.read(size).decode()

        if code_flag:
            REFS_HASH[idx] = code_dict[`value`]
    elif code_type == TYPE_REF:
        code_dict[`type`] = `ref`
        code_dict[`ref`] = int.from_bytes(fp.read(4), `little`)
        code_dict[`value`] = REFS_HASH[code_dict[`ref`]]
    elif code_type == TYPE_NONE:
        code_dict[`type`] = `none`
    else:
        print(code_type)

    return code_dict

def parse_pyc(file_name):
    pyc_dict = {}

    with open(file_name, `rb`) as fp:
        magic_number = int.from_bytes(fp.read(2), `little`)
        if magic_number >= 3390 and magic_number <= 3392:
            pyc_dict[`version`] = `Python 3.7`
        else:
            print(`only support Python 3.7`)
            exit(0)
        
        _ = fp.read(2)
        _ = fp.read(4)

        timestamp = int.from_bytes(fp.read(4), `little`)
        pyc_dict[`modified`] = str(datetime.datetime.fromtimestamp(timestamp))

        source_size = int.from_bytes(fp.read(4), `little`)
        pyc_dict[`size`] = source_size
        pyc_dict[`code`] = parse_code(fp)

    return(pyc_dict)

if __name__ == `__main__`:
    file_name = sys.argv[1]
    print(json.dumps(parse_pyc(file_name), indent=2))

分析test.py後結果為:

實現了對TYPE_REF的處理,下面的value值並不在真實的二進位制中包含:

"co_filename": {
    "type": "ref",
    "ref": 6,
    "value": "/Users/l.wang/Documents/pythonindepth/bin/test.py"
}

目前沒有對指令集做處理。


相關文章