Python記憶體資料庫/引擎

Prodesire發表於2016-09-07

1 初探

在平時的開發工作中，我們可能會有這樣的需求：我們希望有一個記憶體資料庫或者資料引擎，用比較Pythonic的方式進行資料庫的操作（比如說插入和查詢）。

舉個具體的例子，分別向資料庫db中插入兩條資料，”a=1, b=1″ 和 “a=1, b=2”, 然後想查詢a=1的資料可能會使用這樣的語句db.query(a=1)，結果就是返回前面插入的兩條資料；如果想查詢a=1, b=2的資料，就使用這樣的語句db.query(a=1, b=2)，結果就返回前面的第二條資料。

那麼是否擁有實現上述需求的現成的第三方庫呢？幾經查詢，發現PyDbLite能夠滿足這樣的需求。其實，PyDbLite和Python自帶的SQLite均支援記憶體資料庫模式，只是前者是Pythonic的用法，而後者則是典型的SQL用法。
他們具體的用法是這樣的：

PyDbLite

import pydblite
# 使用記憶體資料庫
pydb = pydblite.Base(':memory:')
# 建立a,b,c三個欄位
pydb.create('a', 'b', 'c')
# 為欄位a,b建立索引
pydb.create_index('a', 'b')
# 插入一條資料
pydb.insert(a=-1, b=0, c=1)
# 查詢符合特定要求的資料
results = pydb(a=-1, b=0)

import pydblite

# 使用記憶體資料庫

pydb = pydblite.Base(':memory:')

# 建立a,b,c三個欄位

pydb.create('a', 'b', 'c')

# 為欄位a,b建立索引

pydb.create_index('a', 'b')

# 插入一條資料

pydb.insert(a=-1, b=0, c=1)

# 查詢符合特定要求的資料

results = pydb(a=-1, b=0)

SQLite

import sqlite3
# 使用記憶體資料庫
con = sqlite3.connect(':memory:')
# 建立a,b,c三個欄位
cur = con.cursor()
cur.execute('create table test (a char(256), b char(256), c char(256));')
# 為欄位a,b建立索引
cur.execute('create index a_index on test(a)')
cur.execute('create index b_index on test(b)')
# 插入一條資料
cur.execute('insert into test values(?, ?, ?)', (-1,0,1))
# 查詢符合特定要求的資料
cur.execute('select * from test where a=? and b=?',(-1, 0))

import sqlite3

# 使用記憶體資料庫

con = sqlite3.connect(':memory:')

# 建立a,b,c三個欄位

cur = con.cursor()

cur.execute('create table test (a char(256), b char(256), c char(256));')

# 為欄位a,b建立索引

cur.execute('create index a_index on test(a)')

cur.execute('create index b_index on test(b)')

# 插入一條資料

cur.execute('insert into test values(?, ?, ?)', (-1,0,1))

# 查詢符合特定要求的資料

cur.execute('select * from test where a=? and b=?',(-1, 0))

2 pydblite和sqlite的效能

毫無疑問，pydblite的使用方式非常地Pythonic，但是它的效率如何呢？由於我們主要關心的是資料插入和查詢速度，所以不妨僅對這兩項做一個對比。寫一個簡單的測試指令碼：

import time
count = 100000

def timeit(func):
    def wrapper(*args, **kws):
        t = time.time()
        func(*args)
        print time.time() - t, kws['des']
    return wrapper

@timeit
def test_insert(mdb, des=''):
    for i in xrange(count):
        mdb.insert(a=i-1, b=i, c=i+1)

@timeit
def test_query_object(mdb, des=''):
    for i in xrange(count):
        c = mdb(a=i-1, b=i)

@timeit
def test_sqlite_insert(cur, des=''):
    for i in xrange(count):
        cur.execute('insert into test values(?, ?, ?)', (i-1, i, i+1))

@timeit
def test_sqlite_query(cur, des=''):
    for i in xrange(count):
        cur.execute('select * from test where a=? and b=?', (i-1, i))

print '-------pydblite--------'
import pydblite
pydb = pydblite.Base(':memory:')
pydb.create('a', 'b', 'c')
pydb.create_index('a', 'b')
test_insert(pydb, des='insert')
test_query_object(pydb, des='query, object call')


print '-------sqlite3--------'
import sqlite3
con = sqlite3.connect(':memory:')
cur = con.cursor()
cur.execute('create table test (a char(256), b char(256), c char(256));')
cur.execute('create index a_index on test(a)')
cur.execute('create index b_index on test(b)')
test_sqlite_insert(cur, des='insert')
test_sqlite_query(cur, des='query')

import time

count = 100000

def timeit(func):

def wrapper(*args, **kws):

t = time.time()

func(*args)

print time.time() - t, kws['des']

return wrapper

@timeit

def test_insert(mdb, des=''):

for i in xrange(count):

mdb.insert(a=i-1, b=i, c=i+1)

@timeit

def test_query_object(mdb, des=''):

for i in xrange(count):

c = mdb(a=i-1, b=i)

@timeit

def test_sqlite_insert(cur, des=''):

for i in xrange(count):

cur.execute('insert into test values(?, ?, ?)', (i-1, i, i+1))

@timeit

def test_sqlite_query(cur, des=''):

for i in xrange(count):

cur.execute('select * from test where a=? and b=?', (i-1, i))

print '-------pydblite--------'

import pydblite

pydb = pydblite.Base(':memory:')

pydb.create('a', 'b', 'c')

pydb.create_index('a', 'b')

test_insert(pydb, des='insert')

test_query_object(pydb, des='query, object call')

print '-------sqlite3--------'

import sqlite3

con = sqlite3.connect(':memory:')

cur = con.cursor()

cur.execute('create table test (a char(256), b char(256), c char(256));')

cur.execute('create index a_index on test(a)')

cur.execute('create index b_index on test(b)')

test_sqlite_insert(cur, des='insert')

test_sqlite_query(cur, des='query')

在建立索引的情況下，10w次的插入和查詢的時間如下：

-------pydblite--------
1.14199995995 insert
0.308000087738 query, object call
-------sqlite3--------
0.411999940872 insert
0.30999994278 query

-------pydblite--------

1.14199995995 insert

0.308000087738 query, object call

-------sqlite3--------

0.411999940872 insert

0.30999994278 query

在未建立索引的情況（把建立索引的測試語句註釋掉）下，1w次的插入和查詢時間如下：

-------pydblite--------
0.0989999771118 insert
5.15300011635 query, object call
-------sqlite3--------
0.0169999599457 insert
7.43400001526 query

-------pydblite--------

0.0989999771118 insert

5.15300011635 query, object call

-------sqlite3--------

0.0169999599457 insert

7.43400001526 query

我們不難得出如下結論：

sqlite的插入速度是pydblite的3-5倍；而在建立索引的情況下，sqlite的查詢速度和pydblite相當；在未建立索引的情況下，sqlite的查詢速度比pydblite慢1.5倍左右。

3 優化

我們的目標非常明確，使用Pythonic的記憶體資料庫，提高插入和查詢效率，而不考慮持久化。那麼能否既擁有pydblite的pythonic的使用方式，又同時具備pydblite和sqlite中插入和查詢速度快的那一方的速度？針對我們的目標，看看能否對pydblite做一些優化。

閱讀pydblite的原始碼，首先映入眼簾的是對python2和3做了一個簡單的區分。給外部呼叫的Base基於_BasePy2或者_BasePy3，它們僅僅是在__iter__上有細微差異，最終呼叫的是_Base這個類。

class _BasePy2(_Base):

    def __iter__(self):
        """Iteration on the records"""
        return iter(self.records.itervalues())


class _BasePy3(_Base):

    def __iter__(self):
        """Iteration on the records"""
        return iter(self.records.values())

if sys.version_info[0] == 2:
    Base = _BasePy2
else:
    Base = _BasePy3

class _BasePy2(_Base):

def __iter__(self):

"""Iteration on the records"""

return iter(self.records.itervalues())

class _BasePy3(_Base):

def __iter__(self):

"""Iteration on the records"""

return iter(self.records.values())

if sys.version_info[0] == 2:

Base = _BasePy2

else:

Base = _BasePy3

然後看下_Base的建構函式，做了簡單的初始化檔案的操作，由於我們就是使用記憶體資料庫，所以檔案相關的內容完全可以拋棄。

class _Base(object):

    def __init__(self, path, protocol=pickle.HIGHEST_PROTOCOL, save_to_file=True,
                 sqlite_compat=False):
        """protocol as defined in pickle / pickle.
        Defaults to the highest protocol available.
        For maximum compatibility use protocol = 0

        """
        self.path = path
        """The path of the database in the file system"""
        self.name = os.path.splitext(os.path.basename(path))[0]
        """The basename of the path, stripped of its extension"""
        self.protocol = protocol
        self.mode = None
        if path == ":memory:":
            save_to_file = False
        self.save_to_file = save_to_file
        self.sqlite_compat = sqlite_compat
        self.fields = []
        """The list of the fields (does not include the internal
        fields __id__ and __version__)"""
        # if base exists, get field names
        if save_to_file and self.exists():
            if protocol == 0:
                _in = open(self.path)  # don't specify binary mode !
            else:
                _in = open(self.path, 'rb')
            self.fields = pickle.load(_in)

class _Base(object):

def __init__(self, path, protocol=pickle.HIGHEST_PROTOCOL, save_to_file=True,

sqlite_compat=False):

"""protocol as defined in pickle / pickle.

Defaults to the highest protocol available.

For maximum compatibility use protocol = 0

"""

self.path = path

"""The path of the database in the file system"""

self.name = os.path.splitext(os.path.basename(path))[0]

"""The basename of the path, stripped of its extension"""

self.protocol = protocol

self.mode = None

if path == ":memory:":

save_to_file = False

self.save_to_file = save_to_file

self.sqlite_compat = sqlite_compat

self.fields = []

"""The list of the fields (does not include the internal

fields __id__ and __version__)"""

# if base exists, get field names

if save_to_file and self.exists():

if protocol == 0:

_in = open(self.path) # don't specify binary mode !

else:

_in = open(self.path, 'rb')

self.fields = pickle.load(_in)

緊接著比較重要的是create（建立欄位）、create_index（建立索引）兩個函式：

    def create(self, *fields, **kw):
        """
        Create a new base with specified field names.

        Args:
            - *fields (str): The field names to create.
            - mode (str): the mode used when creating the database.

        - if mode = 'create' : create a new base (the default value)
        - if mode = 'open' : open the existing base, ignore the fields
        - if mode = 'override' : erase the existing base and create a
          new one with the specified fields

        Returns:
            - the database (self).
        """
        self.mode = kw.get("mode", 'create')
        if self.save_to_file and os.path.exists(self.path):
            if not os.path.isfile(self.path):
                raise IOError("%s exists and is not a file" % self.path)
            elif self.mode is 'create':
                raise IOError("Base %s already exists" % self.path)
            elif self.mode == "open":
                return self.open()
            elif self.mode == "override":
                os.remove(self.path)
            else:
                raise ValueError("Invalid value given for 'open': '%s'" % open)

        self.fields = []
        self.default_values = {}
        for field in fields:
            if type(field) is dict:
                self.fields.append(field["name"])
                self.default_values[field["name"]] = field.get("default", None)
            elif type(field) is tuple:
                self.fields.append(field[0])
                self.default_values[field[0]] = field[1]
            else:
                self.fields.append(field)
                self.default_values[field] = None

        self.records = {}
        self.next_id = 0
        self.indices = {}
        self.commit()
        return self

    def create_index(self, *fields):
        """
        Create an index on the specified field names

        An index on a field is a mapping between the values taken by the field
        and the sorted list of the ids of the records whose field is equal to
        this value

        For each indexed field, an attribute of self is created, an instance
        of the class Index (see above). Its name it the field name, with the
        prefix _ to avoid name conflicts

        Args:
            - fields (list): the fields to index
        """
        reset = False
        for f in fields:
            if f not in self.fields:
                raise NameError("%s is not a field name %s" % (f, self.fields))
            # initialize the indices
            if self.mode == "open" and f in self.indices:
                continue
            reset = True
            self.indices[f] = {}
            for _id, record in self.records.items():
                # use bisect to quickly insert the id in the list
                bisect.insort(self.indices[f].setdefault(record[f], []), _id)
            # create a new attribute of self, used to find the records
            # by this index
            setattr(self, '_' + f, Index(self, f))
        if reset:
            self.commit()

def create(self, *fields, **kw):

"""

Create a new base with specified field names.

Args:

- *fields (str): The field names to create.

- mode (str): the mode used when creating the database.

- if mode = 'create' : create a new base (the default value)

- if mode = 'open' : open the existing base, ignore the fields

- if mode = 'override' : erase the existing base and create a

new one with the specified fields

Returns:

- the database (self).

"""

self.mode = kw.get("mode", 'create')

if self.save_to_file and os.path.exists(self.path):

if not os.path.isfile(self.path):

raise IOError("%s exists and is not a file" % self.path)

elif self.mode is 'create':

raise IOError("Base %s already exists" % self.path)

elif self.mode == "open":

return self.open()

elif self.mode == "override":

os.remove(self.path)

else:

raise ValueError("Invalid value given for 'open': '%s'" % open)

self.fields = []

self.default_values = {}

for field in fields:

if type(field) is dict:

self.fields.append(field["name"])

self.default_values[field["name"]] = field.get("default", None)

elif type(field) is tuple:

self.fields.append(field[0])

self.default_values[field[0]] = field[1]

else:

self.fields.append(field)

self.default_values[field] = None

self.records = {}

self.next_id = 0

self.indices = {}

self.commit()

return self

def create_index(self, *fields):

"""

Create an index on the specified field names

An index on a field is a mapping between the values taken by the field

and the sorted list of the ids of the records whose field is equal to

this value

For each indexed field, an attribute of self is created, an instance

of the class Index (see above). Its name it the field name, with the

prefix _ to avoid name conflicts

Args:

- fields (list): the fields to index

"""

reset = False

for f in fields:

if f not in self.fields:

raise NameError("%s is not a field name %s" % (f, self.fields))

# initialize the indices

if self.mode == "open" and f in self.indices:

continue

reset = True

self.indices[f] = {}

for _id, record in self.records.items():

# use bisect to quickly insert the id in the list

bisect.insort(self.indices[f].setdefault(record[f], []), _id)

# create a new attribute of self, used to find the records

# by this index

setattr(self, '_' + f, Index(self, f))

if reset:

self.commit()

可以看出，pydblite在記憶體中維護了一個名為records的字典變數，用來存放一條條的資料。它的key是內部維護的id，從0開始自增；而它的value則是使用者插入的資料，為了後續查詢和記錄的方便，這裡在每條資料中額外又加入了__id__和__version__。其次，內部維護的indices字典變數則是是個索引表，它的key是欄位名，而value則是這樣一個字典：其key是這個欄位所有已知的值，value是這個值所在的那條資料的id。

舉個例子，假設我們插入了“a=-1,b=0,c=1”和“a=0,b=1,c=2”兩條資料，那麼records和indices的內容會是這樣的：

# records
{0: {'__id__': 0, '__version__': 0, 'a': -1, 'b': 0, 'c': 1},
 1: {'__id__': 1, '__version__': 0, 'a': 0, 'b': 1, 'c': 2}}

# indices
{'a': {-1: [0], 0: [1]}, 'b': {0: [0], 1: [1]}}

# records

{0: {'__id__': 0, '__version__': 0, 'a': -1, 'b': 0, 'c': 1},

1: {'__id__': 1, '__version__': 0, 'a': 0, 'b': 1, 'c': 2}}

# indices

{'a': {-1: [0], 0: [1]}, 'b': {0: [0], 1: [1]}}

比方說現在我們想查詢a=0的資料，那麼就會在indices中找key為’a’的value，即{-1: set([0]), 0: set([1])}，然後在這裡面找key為0的value，即[1]，由此我們直到了我們想要的這條資料它的id是1（也可能會有多個）；假設我們對資料還有其他要求比如a=0,b=1，那麼它會繼續上述的查詢過程，找到a=0和b=1分別對應的ids，做交集，就得到了滿足這兩個條件的ids，然後再到records里根據ids找到所有對應的資料。

明白了原理，我們再看看有什麼可優化的地方：

資料結構，整體的records和indeices資料結構已經挺精簡了，暫時不需要優化。其中的__version__可以不要，因為我們並不關注這個資料被修改了幾次。其次是由於indices中最終的ids是個list，在查詢和插入的時候會比較慢，我們知道內部維護的id一定是唯一的，所以這裡改成set會好一些。

python語句，不難看出，整個_Base為了同時相容python2和python3，不得不使用了2和3都支援的語句，這就導致在部分語句上針對特定版本的python就會造成浪費或者說是效能開銷。比如說，d是個字典，那麼為了同事相容python2和3，作者使用了類似與for key in d.keys()這樣的語句，在python2中，d.keys()會首先產生一個list，用d.iterkeys是個更明智的方案。再如，作者會使用類似set(d.keys()) – set([1])這樣的語句，但是python2中，使用d.viewkeys() – set([1])效率將會更高，因為它不需要將list轉化成set。

對特定版本python的優化語句就不一一舉例，概括地說，從資料結構，python語句以及是否需要某些功能等方面可以對pydblite做進一步的優化。前面只是說了create和create_index兩個函式，包括insert和__call__的優化也十分類似。此外，用普通方法來代替魔法方法，也能稍微提升下效率，所以在後續的優化中將__call__改寫為了query。

優化後的程式碼，請見MemLite。

4 memlite、pydblite和sqlite的效能

讓我們在上文的測試程式碼中加入對memlite的測試：

@timeit
def test_query_method(mdb, des=''):
    for i in xrange(count):
        c = mdb.query(a=i-1, b=i)

print '-------memlite-------'
import memlite
db = memlite.Base()
db.create('a', 'b', 'c')
db.create_index('a', 'b')
test_insert(db, des='insert')
test_query_method(db, des='query, method call')

@timeit

def test_query_method(mdb, des=''):

for i in xrange(count):

c = mdb.query(a=i-1, b=i)

print '-------memlite-------'

import memlite

db = memlite.Base()

db.create('a', 'b', 'c')

db.create_index('a', 'b')

test_insert(db, des='insert')

test_query_method(db, des='query, method call')

在建立索引的情況下，10w次的插入和查詢的時間如下：

-------memlite-------
0.378000020981 insert
0.285000085831 query, method call
-------pydblite--------
1.3140001297 insert
0.309000015259 query, object call
-------sqlite3--------
0.414000034332 insert
0.3109998703 query

-------memlite-------

0.378000020981 insert

0.285000085831 query, method call

-------pydblite--------

1.3140001297 insert

0.309000015259 query, object call

-------sqlite3--------

0.414000034332 insert

0.3109998703 query

在未建立索引的情況（把建立索引的測試語句註釋掉）下，1w次的插入和查詢時間如下：

-------memlite-------
0.0179998874664 insert
5.90199995041 query, method call
-------pydblite--------
0.0980000495911 insert
4.87400007248 query, object call
-------sqlite3--------
0.0170001983643 insert
7.42399978638 query

-------memlite-------

0.0179998874664 insert

5.90199995041 query, method call

-------pydblite--------

0.0980000495911 insert

4.87400007248 query, object call

-------sqlite3--------

0.0170001983643 insert

7.42399978638 query

可以看出，在建立索引的情況下，memlite的插入和查詢效能在sqlite和pydblite之上；而在未建立索引的情況下，memlite的插入效能和sqlite一樣，好於pydblite，memlite的查詢效能比pydblite稍差，但好於sqlite。綜合來看，memlite即擁有pydblite的pythonic的使用方式，又擁有pydblite和sqlite中效能較高者的效率，符合預期的優化目標。

【記憶體資料庫】TimesTen
2018-04-28
記憶體資料庫
記憶體資料庫如何發揮記憶體優勢？
2023-02-21
記憶體資料庫
【大頁記憶體】Oracle資料庫配置大頁記憶體
2021-08-19
記憶體Oracle資料庫
坑！火山引擎雲資料庫 MySQL 版節點記憶體只增不減
2024-08-17
資料庫MySql記憶體
2G記憶體搞定一億資料的分析引擎
2023-11-20
記憶體
瀚高資料庫記憶體結構
2021-10-18
資料庫記憶體
Oracle - 資料庫的記憶體結構
2018-09-17
Oracle資料庫記憶體
Oracle - 資料庫的記憶體調整
2018-09-17
Oracle資料庫記憶體
資料庫新兵：分散式實時分析記憶體資料庫eSight
2018-12-17
資料庫分散式記憶體
磁碟資料庫與記憶體資料庫的特點比較
2018-03-06
資料庫記憶體
分析師解讀記憶體資料庫MemSQLSP
2022-03-21
記憶體資料庫SQL
資料庫實現原理#6（共享記憶體）
2020-04-20
資料庫記憶體
南大通用極速記憶體資料庫
2021-07-20
記憶體資料庫
PG資料庫記憶體告警了怎麼分析
2023-02-10
資料庫記憶體
從Oracle資料庫故障到AIX記憶體管理
2021-02-03
Oracle資料庫AI記憶體
如何檢視MySQL資料庫佔多大記憶體，佔用太多記憶體怎麼辦？
2022-01-12
MySql資料庫記憶體
記憶體資料庫適合多大規模的資料集？UY
2022-03-21
記憶體資料庫
iOS開發筆記— 資料庫、Crash、記憶體問題分析
2018-09-12
iOS筆記資料庫記憶體
解讀記憶體資料庫的儲存需求RC
2022-03-21
記憶體資料庫
非易失性記憶體技術及資料庫
2020-05-17
記憶體資料庫
記憶體洩漏引起的資料庫效能問題
2021-05-06
記憶體資料庫
Postgresql資料庫體系結構-程式和記憶體結構
2019-04-11
SQL資料庫記憶體
達夢資料庫基礎知識（三）達夢資料庫記憶體結構
2021-10-26
資料庫記憶體
【虹科分享】Redis 不僅僅是記憶體資料庫
2023-11-02
Redis記憶體資料庫
基於記憶體的關聯式資料庫memsql初探
2021-12-27
記憶體資料庫SQL
記憶體資料庫解析與主流產品對比（三）
2021-02-02
記憶體資料庫
記憶體資料庫解析與主流產品對比（二）
2021-01-29
記憶體資料庫
記憶體資料庫解析與主流產品對比（一）
2021-01-29
記憶體資料庫
【虹科乾貨】使用記憶體資料庫解決三個資料庫效能問題
2023-10-30
記憶體資料庫
Redis 記憶體優化神技，小記憶體儲存大資料
2022-07-13
Redis記憶體優化大資料
python的記憶體管理
2018-05-21
Python記憶體
圖資料庫 NebulaGraph 的記憶體管理實踐之 Memory Tracker
2023-05-17
資料庫記憶體
(二) MdbCluster分散式記憶體資料庫——分散式架構1
2023-02-15
分散式記憶體資料庫架構
成為MySQL DBA後，再看ORACLE資料庫（五、記憶體管理）
2024-06-02
MySqlOracle資料庫記憶體
使用記憶體資料庫可以最佳化伺服器效能
2020-06-08
記憶體資料庫伺服器
南大通用極速記憶體資料庫(GBase eXtreme Directory Memory)
2021-02-19
記憶體資料庫REM
Python深入 Python的記憶體管理
2020-03-28
Python記憶體
記憶體中的資料儲存
2019-10-03
記憶體
(一) MdbCluster分散式記憶體資料庫——基礎架構介紹
2023-02-10
分散式記憶體資料庫架構

Python記憶體資料庫/引擎

2 pydblite和sqlite的效能

3 優化

4 memlite、pydblite和sqlite的效能

相關文章