Python3爬蟲資料入資料庫---把爬取到的資料存到資料庫，帶資料庫去重功能

碼農小石頭發表於2018-10-22

原文網址 : https://juejin.im/post/5bcd260b51882577d04c6904

這是python3實戰入門系列的第三篇文章，要學習這一篇需要了解前兩篇，要不學起來比較費勁

下面來正式開始把我們第一節爬取到的新聞資料儲存到mysql資料中

一，首先我們需要連線資料庫

通過定義一個MySQLCommand類來配置資料庫連線引數，並定義一個connectMysql方法連線資料庫

# -*- coding: utf-8 -*-
# 作者微信：2501902696
import pymysql
# 用來運算元據庫的類
class MySQLCommand(object):
    # 類的初始化
    def __init__(self):
        self.host = 'localhost'
        self.port = 3306  # 埠號
        self.user = 'root'  # 使用者名稱
        self.password = ""  # 密碼
        self.db = "home"  # 庫
        self.table = "home_list"  # 表

    # 連結資料庫
    def connectMysql(self):
        try:
            self.conn = pymysql.connect(host=self.host, port=self.port, user=self.user,
                                        passwd=self.password, db=self.db, charset='utf8')
            self.cursor = self.conn.cursor()
        except:
            print('connect mysql error.')
複製程式碼

二，連線完資料庫後我們需要插入資料了

插入資料之前我們有兩個問題

1，重複的資料如何去重
2，新資料的主鍵id應該從哪裡開始針對上面的兩個問題我貼出一部分程式碼來看解決思路

# 插入資料，插入之前先查詢是否存在，如果存在就不再插入
    def insertData(self, my_dict):
        table = "home_list"  # 要操作的表格
        # 注意，這裡查詢的sql語句url=' %s '中%s的前後要有空格
        sqlExit = "SELECT url FROM home_list  WHERE url = ' %s '" % (my_dict['url'])
        res = self.cursor.execute(sqlExit)
        if res:  # res為查詢到的資料條數如果大於0就代表資料已經存在
            print("資料已存在", res)
            return 0
        # 資料不存在才執行下面的插入操作
        try:
            cols = ', '.join(my_dict.keys())#用，分割
            values = '"," '.join(my_dict.values())
            sql = "INSERT INTO home_list (%s) VALUES (%s)" % (cols, '"' + values + '"')
            #拼裝後的sql如下
            # INSERT INTO home_list (img_path, url, id, title) VALUES ("https://img.huxiucdn.com.jpg"," https://www.huxiu.com90.html"," 12"," ")
            try:
                result = self.cursor.execute(sql)
                insert_id = self.conn.insert_id()  # 插入成功後返回的id
                self.conn.commit()
                # 判斷是否執行成功
                if result:
                    print("插入成功", insert_id)
                    return insert_id + 1
            except pymysql.Error as e:
                # 發生錯誤時回滾
                self.conn.rollback()
                # 主鍵唯一，無法插入
                if "key 'PRIMARY'" in e.args[1]:
                    print("資料已存在，未插入資料")
                else:
                    print("插入資料失敗，原因 %d: %s" % (e.args[0], e.args[1]))
        except pymysql.Error as e:
            print("資料庫錯誤，原因%d: %s" % (e.args[0], e.args[1]))
複製程式碼

通過上面程式碼我們來看如何去重

我們在每次插入之前需要查詢下資料是否已經存在，如果存在就不在插入，我們的home_list表格的欄位有 id，title,url,img_path。通過分析我們抓取到的資料titlehe和img_path欄位都可能為空，所以這裡我們通過url欄位來去重。知道去重原理以後再去讀上面的程式碼，你應該能容易理解了

三，查詢資料庫中最後一條資料的id值，來確定我們新資料id的開始值

通過下面的getLastId函式來獲取home_list表裡的最後一條資料的id值

# 查詢最後一條資料的id值
    def getLastId(self):
        sql = "SELECT max(id) FROM " + self.table
        try:
            self.cursor.execute(sql)
            row = self.cursor.fetchone()  # 獲取查詢到的第一條資料
            if row[0]:
                return row[0]  # 返回最後一條資料的id
            else:
                return 0  # 如果表格為空就返回0
        except:
            print(sql + ' execute failed.')
複製程式碼

下面貼出MySQLCommand資料庫操作類的完整程式碼

# -*- coding: utf-8 -*-
# 作者微信：2501902696
import pymysql
# 用來運算元據庫的類
class MySQLCommand(object):
    # 類的初始化
    def __init__(self):
        self.host = 'localhost'
        self.port = 3306  # 埠號
        self.user = 'root'  # 使用者名稱
        self.password = ""  # 密碼
        self.db = "home"  # 庫
        self.table = "home_list"  # 表

    # 連結資料庫
    def connectMysql(self):
        try:
            self.conn = pymysql.connect(host=self.host, port=self.port, user=self.user,
                                        passwd=self.password, db=self.db, charset='utf8')
            self.cursor = self.conn.cursor()
        except:
            print('connect mysql error.')

    # 插入資料，插入之前先查詢是否存在，如果存在就不再插入
    def insertData(self, my_dict):
        table = "home_list"  # 要操作的表格
        # 注意，這裡查詢的sql語句url=' %s '中%s的前後要有空格
        sqlExit = "SELECT url FROM home_list  WHERE url = ' %s '" % (my_dict['url'])
        res = self.cursor.execute(sqlExit)
        if res:  # res為查詢到的資料條數如果大於0就代表資料已經存在
            print("資料已存在", res)
            return 0
        # 資料不存在才執行下面的插入操作
        try:
            cols = ', '.join(my_dict.keys())#用，分割
            values = '"," '.join(my_dict.values())
            sql = "INSERT INTO home_list (%s) VALUES (%s)" % (cols, '"' + values + '"')
            #拼裝後的sql如下
            # INSERT INTO home_list (img_path, url, id, title) VALUES ("https://img.huxiucdn.com.jpg"," https://www.huxiu.com90.html"," 12"," ")
            try:
                result = self.cursor.execute(sql)
                insert_id = self.conn.insert_id()  # 插入成功後返回的id
                self.conn.commit()
                # 判斷是否執行成功
                if result:
                    print("插入成功", insert_id)
                    return insert_id + 1
            except pymysql.Error as e:
                # 發生錯誤時回滾
                self.conn.rollback()
                # 主鍵唯一，無法插入
                if "key 'PRIMARY'" in e.args[1]:
                    print("資料已存在，未插入資料")
                else:
                    print("插入資料失敗，原因 %d: %s" % (e.args[0], e.args[1]))
        except pymysql.Error as e:
            print("資料庫錯誤，原因%d: %s" % (e.args[0], e.args[1]))

    # 查詢最後一條資料的id值
    def getLastId(self):
        sql = "SELECT max(id) FROM " + self.table
        try:
            self.cursor.execute(sql)
            row = self.cursor.fetchone()  # 獲取查詢到的第一條資料
            if row[0]:
                return row[0]  # 返回最後一條資料的id
            else:
                return 0  # 如果表格為空就返回0
        except:
            print(sql + ' execute failed.')

    def closeMysql(self):
        self.cursor.close()
        self.conn.close()  # 建立資料庫操作類的例項
複製程式碼

再貼出把爬蟲爬取資料插入到資料庫的程式碼

# -*- coding: utf-8 -*-
# 作者微信：2501902696
from bs4 import BeautifulSoup
from urllib import request
import chardet

from db.MySQLCommand import MySQLCommand

url = "https://www.huxiu.com"
response = request.urlopen(url)
html = response.read()
charset = chardet.detect(html)
html = html.decode(str(charset["encoding"]))  # 設定抓取到的html的編碼方式

# 使用剖析器為html.parser
soup = BeautifulSoup(html, 'html.parser')
# 獲取到每一個class=hot-article-img的a節點
allList = soup.select('.hot-article-img')

# 連線資料庫
mysqlCommand = MySQLCommand()
mysqlCommand.connectMysql()
#這裡每次查詢資料庫中最後一條資料的id，新加的資料每成功插入一條id+1
dataCount = int(mysqlCommand.getLastId()) + 1
for news in allList:  # 遍歷列表，獲取有效資訊
    aaa = news.select('a')
    # 只選擇長度大於0的結果
    if len(aaa) > 0:
        # 文章連結
        try:  # 如果丟擲異常就代表為空
            href = url + aaa[0]['href']
        except Exception:
            href = ''
        # 文章圖片url
        try:
            imgUrl = aaa[0].select('img')[0]['src']
        except Exception:
            imgUrl = ""
        # 新聞標題
        try:
            title = aaa[0]['title']
        except Exception:
            title = ""

        #把爬取到的每條資料組合成一個字典用於資料庫資料的插入
        news_dict = {
            "id": str(dataCount),
            "title": title,
            "url": href,
            "img_path": imgUrl
        }
        try:
            # 插入資料，如果已經存在就不在重複插入
            res = mysqlCommand.insertData(news_dict)
            if res:
                dataCount=res
        except Exception as e:
            print("插入資料失敗", str(e))#輸出插入失敗的報錯語句
mysqlCommand.closeMysql()  # 最後一定要要把資料關閉
dataCount=0
複製程式碼

如果對上面程式碼不是很瞭解可以到我的第一節文章去看下 python3實戰入門python爬蟲篇---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞

到此我們的python3爬蟲+python3資料庫篇就完事了，看下操作效果圖

gif圖片質量不是很好，大家湊合著看吧☺☺☹☎

寫於---Python零基礎實戰入門第四天

python爬取股票資料並存到資料庫
2021-03-29
Python資料庫
scrapy爬蟲框架呼叫百度地圖api資料存入資料庫
2021-04-30
爬蟲框架地圖API資料庫
python 爬蟲 5i5j房屋資訊獲取並儲存到資料庫
2018-08-20
Python爬蟲資料庫
10W資料匯入該如何與庫中資料去重？
2024-08-16
php資料庫資料如何去除重複資料呢？
2021-04-05
PHP資料庫
使用scrapy框架把資料非同步寫入資料庫
2018-07-16
框架非同步資料庫
【資料庫資料恢復】SAP資料庫資料恢復案例
2022-05-05
資料庫資料恢復
生產資料庫、開發資料庫、測試資料庫中的資料的區分
2021-01-03
資料庫
資料庫　　資料庫的完整性
2018-09-15
資料庫
【資料庫資料恢復】MS SQL資料庫附加資料庫出錯怎麼恢復資料？
2022-12-08
資料庫資料恢復SQL
資料庫PostrageSQL-管理資料庫
2020-12-11
資料庫SQL
【資料庫資料恢復】Sql Server資料庫資料恢復案例
2022-05-27
資料庫資料恢復SQLServer
Kettle 從資料庫讀取資料存到變數中
2024-05-29
資料庫變數
儲存資料到MySql資料庫——我用scrapy寫爬蟲（二）
2019-02-16
MySql資料庫爬蟲
【資料庫資料恢復】windows server下SqlServer資料庫的資料恢復
2022-06-09
資料庫資料恢復WindowsServerSQL
【資料庫資料恢復】如何恢復Oracle資料庫truncate表的資料
2022-10-24
資料庫資料恢復Oracle
Python網路爬蟲抓取動態網頁並將資料存入資料庫MYSQL
2019-01-04
Python爬蟲網頁資料庫MySql
DataX將MySql資料庫資料同步到Oracle資料庫
2024-05-16
MySql資料庫Oracle
Oracle資料庫-----資料庫的基本概念
2018-12-21
Oracle資料庫
【資料庫設計】資料庫的設計
2018-06-21
資料庫
將資料庫中資料匯入至solr索引庫
2020-11-11
資料庫Solr索引
把雲資料庫帶回家！阿里雲釋出POLARDB Box資料庫一體機
2019-10-08
資料庫阿里
【資料庫資料恢復】Oracle資料庫誤truncate table的資料恢復案例
2022-05-17
資料庫資料恢復Oracle
【資料庫資料恢復】誤truncate table的Oracle資料庫資料恢復方案
2023-03-24
資料庫資料恢復Oracle
房產資料爬取、智慧財產權資料爬取、企業工商資料爬取、抖音直播間資料python爬蟲爬取
2024-07-11
Python爬蟲
Mysql資料庫-資料模型
2024-05-26
MySql資料庫模型
MySQL資料庫資料管理
2020-10-15
MySql資料庫
IndexedDB 資料庫新增資料
2019-06-18
Index資料庫
資料庫介紹--初識資料庫
2018-07-15
資料庫
資料湖 vs 倉庫 vs 資料庫
2022-01-16
資料庫
資料庫概論（一）資料庫概念
2021-01-14
資料庫
【Falsk 使用資料庫】---- 資料庫基本操作
2020-12-20
資料庫
如何將 EXCEL 資料寫入資料庫
2020-06-16
Excel資料庫
資料庫 MySQL 資料匯入匯出
2021-08-10
資料庫MySql
【資料庫資料恢復】透過資料頁恢復Sql Server資料庫資料的過程
2023-04-28
資料庫資料恢復SQLServer
DataX將Oracle資料庫資料同步到達夢資料庫
2024-05-17
Oracle資料庫
織夢資料庫_織夢還原資料庫_織夢資料庫很卡
2024-10-05
資料庫
【資料庫資料恢復】sql server資料庫連線失效的資料恢復案例
2023-05-12
資料庫資料恢復SQLServer