大規模非同步新聞爬蟲：讓MySQL 資料庫操作更方便

王平發表於2018-12-03

原文網址 : https://www.yuanrenxue.com/crawler/news-crawler-mysql-tool.html

小猿們還記得最開始我們實現的那個槽點多多的百度新聞爬蟲嗎？那裡的邏輯最後是把下載的網頁和網址儲存到資料庫，但是我們只是簡單的實現為列印資訊。

現如今，我們能用的資料庫很多，老牌關係型資料庫如 MySQL(MariaDB), PostgreSQL 等，新型的NoSQL資料庫，還有NewSqL資料庫。選擇實在太多，但MySQL(Mariadb)從易獲取性、易使用性、穩定性、社群活躍性方面都有較大優勢，所以，我們在夠用的情況下都選擇MySQL。

封裝mysql成一個python包

今天，我們就把MySQL的操作單獨拿出來探討一下，並實現一個更方便的封裝。

Python對MySQL操作的模組最好的兩個模組是:

1. MySQLdb
這是一個老牌的MySQL模組，它封裝了MySQL client的C語言API，但是它主要支援Python 2.x的版本，後來有人fork了一個版本加入了Python 3的支援，並起名為mysqlclient-python 它的pypi包名為mysqlclient，所以通過pip安裝就是 pip install mysqlclient

2. PyMySQL
這是一個純Python實現的MySQL客戶端。因為是純Python實現，它和Python 3的非同步模組aysncio可以很好的結合起來，形成了aiomysql 模組，後面我們寫非同步爬蟲時就可以對資料庫進行非同步操作了。

通過以上簡單的對比，我們選擇了PyMySQL來作為我們的資料庫客戶端模組。

老猿我在Python中操作MySQL的時間已經有十多年了，總結下來，還是tornado裡面的那個torndb的封裝使用比較方便。torndb在Python 2.x時代早就出現了，那時候它是對MySQLdb的封裝。後來接觸Python 3 和 PyMySQL，就自己參考torndb和自己的經驗，對PyMySQL進行了一個封裝，並給它起了個很土的名字： ezpymysql

不過，這個很土的名字背後，還是有著讓人省心的方便，希望小猿們能看在它好用的份兒上，別計較它很土的名字。

廢話不多講，程式碼接著上！

1. 使用示例

首先我們先通過一個使用例子看看它的方便性：

from ezpymysql import Connection

db = Connection(
    'localhost',
    'db_name',
    'user',
    'password'
)
# 獲取一條記錄
sql = 'select * from test_table where id=%s'
data = db.get(sql, 2)

# 獲取多天記錄
sql = 'select * from test_table where id>%s'
data = db.query(sql, 2)

# 插入一條資料
sql = 'insert into test_table(title, url) values(%s, %s)'
last_id = db.execute(sql, 'test', 'http://a.com/')
# 或者
last_id = db.insert(sql, 'test', 'http://a.com/')


# 使用更高階的方法插入一條資料
item = {
    'title': 'test',
    'url': 'http://a.com/',
}
last_id = db.table_insert('test_table', item)

它的使用分兩步：
首先，建立一個MySQL 連線；
然後，通過sql語句查詢或插入資料。

可能有小猿會提出疑問，為什麼不用像SQLAlchemy之類的ORM呢？簡單說，就是因為這個簡單，我們的操作基本上都是查詢和插入，用基本的select, insert這些sql語句是最方便和簡單的。而ORM要先對錶建立對映模型，查詢方法也是因ORM而不同，過度的封裝很不適合爬蟲應用場景。其實，老猿我在寫web應用時，仍然是自己寫sql，感覺就是那麼的清爽！

好吧，不再賣關子了，該上ezpymysql的實現了。

2. 具體實現

#File: ezpymysql.py
#Author: veelion

"""A lightweight wrapper around PyMySQL.
only for python3

"""

import time
import logging
import traceback
import pymysql.cursors

version = "0.7"
version_info = (0, 7, 0, 0)


class Connection(object):
    """A lightweight wrapper around PyMySQL.
    """
    def __init__(self, host, database, user=None, password=None,
                 port=0,
                 max_idle_time=7 * 3600, connect_timeout=10,
                 time_zone="+0:00", charset = "utf8mb4", sql_mode="TRADITIONAL"):
        self.host = host
        self.database = database
        self.max_idle_time = float(max_idle_time)

        args = dict(use_unicode=True, charset=charset,
                    database=database,
                    init_command=('SET time_zone = "%s"' % time_zone),
                    cursorclass=pymysql.cursors.DictCursor,
                    connect_timeout=connect_timeout, sql_mode=sql_mode)
        if user is not None:
            args["user"] = user
        if password is not None:
            args["passwd"] = password

        # We accept a path to a MySQL socket file or a host(:port) string
        if "/" in host:
            args["unix_socket"] = host
        else:
            self.socket = None
            pair = host.split(":")
            if len(pair) == 2:
                args["host"] = pair[0]
                args["port"] = int(pair[1])
            else:
                args["host"] = host
                args["port"] = 3306
        if port:
            args['port'] = port

        self._db = None
        self._db_args = args
        self._last_use_time = time.time()
        try:
            self.reconnect()
        except Exception:
            logging.error("Cannot connect to MySQL on %s", self.host,
                          exc_info=True)

    def _ensure_connected(self):
        # Mysql by default closes client connections that are idle for
        # 8 hours, but the client library does not report this fact until
        # you try to perform a query and it fails.  Protect against this
        # case by preemptively closing and reopening the connection
        # if it has been idle for too long (7 hours by default).
        if (self._db is None or
            (time.time() - self._last_use_time > self.max_idle_time)):
            self.reconnect()
        self._last_use_time = time.time()

    def _cursor(self):
        self._ensure_connected()
        return self._db.cursor()

    def __del__(self):
        self.close()

    def close(self):
        """Closes this database connection."""
        if getattr(self, "_db", None) is not None:
            self._db.close()
            self._db = None

    def reconnect(self):
        """Closes the existing database connection and re-opens it."""
        self.close()
        self._db = pymysql.connect(**self._db_args)
        self._db.autocommit(True)

    def query(self, query, *parameters, **kwparameters):
        """Returns a row list for the given query and parameters."""
        cursor = self._cursor()
        try:
            cursor.execute(query, kwparameters or parameters)
            result = cursor.fetchall()
            return result
        finally:
            cursor.close()

    def get(self, query, *parameters, **kwparameters):
        """Returns the (singular) row returned by the given query.
        """
        cursor = self._cursor()
        try:
            cursor.execute(query, kwparameters or parameters)
            return cursor.fetchone()
        finally:
            cursor.close()

    def execute(self, query, *parameters, **kwparameters):
        """Executes the given query, returning the lastrowid from the query."""
        cursor = self._cursor()
        try:
            cursor.execute(query, kwparameters or parameters)
            return cursor.lastrowid
        except Exception as e:
            if e.args[0] == 1062:
                pass
            else:
                traceback.print_exc()
                raise e
        finally:
            cursor.close()

    insert = execute

    ## =============== high level method for table ===================

    def table_has(self, table_name, field, value):
        if isinstance(value, str):
            value = value.encode('utf8')
        sql = 'SELECT %s FROM %s WHERE %s="%s"' % (
            field,
            table_name,
            field,
            value)
        d = self.get(sql)
        return d

    def table_insert(self, table_name, item):
        '''item is a dict : key is mysql table field'''
        fields = list(item.keys())
        values = list(item.values())
        fieldstr = ','.join(fields)
        valstr = ','.join(['%s'] * len(item))
        for i in range(len(values)):
            if isinstance(values[i], str):
                values[i] = values[i].encode('utf8')
        sql = 'INSERT INTO %s (%s) VALUES(%s)' % (table_name, fieldstr, valstr)
        try:
            last_id = self.execute(sql, *values)
            return last_id
        except Exception as e:
            if e.args[0] == 1062:
                # just skip duplicated item
                pass
            else:
                traceback.print_exc()
                print('sql:', sql)
                print('item:')
                for i in range(len(fields)):
                    vs = str(values[i])
                    if len(vs) > 300:
                        print(fields[i], ' : ', len(vs), type(values[i]))
                    else:
                        print(fields[i], ' : ', vs, type(values[i]))
                raise e

    def table_update(self, table_name, updates,
                     field_where, value_where):
        '''updates is a dict of {field_update:value_update}'''
        upsets = []
        values = []
        for k, v in updates.items():
            s = '%s=%%s' % k
            upsets.append(s)
            values.append(v)
        upsets = ','.join(upsets)
        sql = 'UPDATE %s SET %s WHERE %s="%s"' % (
            table_name,
            upsets,
            field_where, value_where,
        )
        self.execute(sql, *(values))

3. 使用方法

這個實現是對pymysql的簡單封裝，但提供了一些方便的操作：

1. 建立MySQL連線

db = Connection(
    'localhost',
    'db_name',
    'user',
    'password'
)

一般只需要四個引數就可以建立連線了：

host：資料庫地址，本節就是localhost
database：資料庫名
user：資料庫使用者名稱
password：資料庫使用者的密碼

後面還有幾個引數可酌情使用：

max_idle_time： MySQL server預設8小時閒置就會斷開客戶端的連線；這個引數告訴客戶端閒置多長時間要重新連線；
time_zone: 這裡預設時區為0區，你可以設定為自己的時區，比如東8區 +8:00;
charset：預設為utf8mb4，即支援moji字元的utf8;

** 2. 運算元據庫**
資料庫操作分為兩類：讀和寫。
讀操作：使用get()獲取一個資料，返回的是一個dict，key就是資料庫表的欄位；使用query()來獲取一組資料，返回的是一個list，其中每個item就是一個dict，跟get()返回的字典一樣。
寫操作：使用insert()或execute()，看原始碼就知道，inseret就是execute的別名。

** 3. 高階操作**
以table_開頭的方法：

table_has() 查詢某個值是否存在於表中。查詢的欄位最好建立的在MySQL中建立了索引，不然資料量稍大就會很慢。
table_insert() 把一個字典型別的資料插入表中。字典的key必須是表的欄位。
table_update() 更新表中的一條記錄。其中, field_where最好是建立了索引，不然資料量稍大就會很慢。

好了，這就是我們封裝的MySQL資料庫模組，通過簡潔的方法來使用，加快我們今後寫爬蟲的速度，是寫爬蟲儲存資料的居家必備之良器哦，還不趕緊收藏起來。

爬蟲知識點

1. logging 模組
Python提供的輸出日誌的模組，可以輸出到螢幕（stdout、stderr），也可以輸出到檔案。爬蟲在執行過程中，可能會碰到千奇百怪的異常，把這些異常都記錄下來，可以很好的幫助改善爬蟲。

2. pymysql
一個純Python實現的MySQL客戶端。在使用中，我們把它封裝為ezpymysql。

準備工作都做完了，下一篇我們實現一個：
同步定向新聞爬蟲

我的公眾號：猿人學 Python 上會分享更多心得體會，敬請關注。

***版權申明:若沒有特殊說明，文章皆是猿人學 yuanrenxue.com 原創，沒有猿人學授權，請勿以任何形式轉載。***

大規模非同步新聞爬蟲：實現一個同步定向新聞爬蟲
2018-12-03
非同步爬蟲
大規模非同步新聞爬蟲：用asyncio實現非同步爬蟲
2018-12-03
非同步爬蟲
大規模非同步新聞爬蟲：簡單的百度新聞爬蟲
2018-12-02
非同步爬蟲
大規模非同步新聞爬蟲的實現思路
2019-05-20
非同步爬蟲
大規模非同步新聞爬蟲：網頁正文的提取
2018-12-03
非同步爬蟲網頁
大規模非同步新聞爬蟲的分散式實現
2019-06-10
非同步爬蟲分散式
大規模非同步新聞爬蟲：實現一個更好的網路請求函式
2018-12-02
非同步爬蟲函式
大規模非同步新聞爬蟲：實現功能強大、簡潔易用的網址池(URL Pool)
2018-12-03
非同步爬蟲
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
讓測試更方便系列：快速建立資料
2020-05-17
爬蟲實戰：探索XPath爬蟲技巧之熱榜新聞
2024-03-21
爬蟲
爬蟲學習整理（3）資料儲存——Python對MySql操作
2020-09-26
爬蟲PythonMySql
不踩坑的Python爬蟲：如何在一個月內學會爬取大規模資料
2018-06-14
Python爬蟲
puppeteer+mysql—爬蟲新方法！抓取新聞&評論so easy！
2018-09-17
MySql爬蟲
資料庫三大正規化 Mysql
2020-11-27
資料庫MySql
爬蟲 | 非同步請求aiohttp模組
2024-06-16
爬蟲非同步AIHTTP
儲存資料到MySql資料庫——我用scrapy寫爬蟲（二）
2019-02-16
MySql資料庫爬蟲
MySQL 資料庫操作
2020-08-28
MySql資料庫
每秒採集幾十萬資料的大規模分散式爬蟲是如何煉成的？
2022-04-16
分散式爬蟲
大規模爬蟲為什麼要管理DNS快取
2019-06-20
爬蟲DNS快取
Python爬蟲百度新聞標題
2020-11-29
Python爬蟲
資料庫非同步操作
2019-01-08
資料庫非同步
爬蟲系列：使用 MySQL 儲存資料
2021-12-09
爬蟲MySql
MySQL資料庫規範 (設計規範+開發規範+操作規範)
2020-10-17
MySql資料庫
[資料庫]【MySQL】MySQL資料庫規範總結
2019-03-10
資料庫MySql
Watchdogs利用Redis實施大規模挖礦，常見資料庫蠕蟲如何破？
2019-03-13
Redis資料庫
Mysql資料庫操作命令
2019-08-19
MySql資料庫
PHP操作MySQL資料庫
2020-10-01
PHPMySql資料庫
MySQL資料庫常用操作
2018-04-17
MySql資料庫
[資料庫]MYSQL主從同步
2019-02-20
資料庫MySql主從同步
mysql資料庫規範
2020-06-15
MySql資料庫
Python網路爬蟲抓取動態網頁並將資料存入資料庫MYSQL
2019-01-04
Python爬蟲網頁資料庫MySql
DataX將MySql資料庫資料同步到Oracle資料庫
2024-05-16
MySql資料庫Oracle
Python3爬蟲資料入資料庫---把爬取到的資料存到資料庫，帶資料庫去重功能
2018-10-22
Python爬蟲資料庫
爬蟲搭建代理池、爬取某網站影片案例、爬取新聞案例
2023-03-16
爬蟲網站
Jsoup + HtmlUtil 實現網易新聞網頁爬蟲
2019-01-14
JSHTML網頁爬蟲
通用新聞爬蟲開發系列（專案介紹）
2022-02-18
爬蟲
Python之操作 MySQL 資料庫
2019-01-07
PythonMySql資料庫

大規模非同步新聞爬蟲： 讓MySQL 資料庫操作更方便

1. 使用示例

2. 具體實現

3. 使用方法

爬蟲知識點

相關文章

大規模非同步新聞爬蟲：讓MySQL 資料庫操作更方便