前言：在爬蟲過程中，我們可能需要重複的爬取同一個網站，為了避免重複的資料存入我們的資料庫中透過實現增量去重去解決這一問題本文還針對了那些需要實時更新的網站增加了一個定時爬取的功能。

　　解決思路：

　　1.獲取目標url

　　2.解析網頁

　　3.存入資料庫(增量去重)

　　4.異常處理

　　5.實時更新(定時爬取)

　　下面為資料庫的配置 mysql_congif.py：

　　import pymysql

　　def insert_db(db_table, issue, time_str, num_code):

　　host = '127.0.0.1'

　　user = 'root'

　　password = 'root'

　　port = 3306

　　db = 'lottery'

　　data_base = pymysql.connect(host=host, user=user, password=password, port=port, db=db)

　　cursor = data_base.cursor()

　　try:

　　sql = "INSERT INTO %s VALUES ('%s','%s','%s')" % (db_table, issue, time_str, num_code)

　　cursor.execute(sql)

　　data_base.commit()

　　except ValueError as e:

　　print(e)

　　data_base.rollback()

　　finally:

　　cursor.close()

　　data_base.close()

　　def select_db(issue, db_table):

　　host = '127.0.0.1'

　　user = 'root'

　　password = 'root'

　　port = 3306

　　db = 'lottery'

　　data_base = pymysql.connect(host=host, user=user, password=password, port=port, db=db)

　　cursor = data_base.cursor()

　　try:

　　sql = "SELECT '%s' FROM %s " % (issue, db_table)

　　cursor.execute(sql)

　　data_base.commit()

　　except ValueError as e:

　　print(e)

　　data_base.rollback()

　　finally:

　　return issue

　　接下來是主要程式碼 test.py：

　　# 使用bs4進行網頁解析

　　# 實現了增量去重

　　# 實現了定時爬取

　　import datetime

　　import time

　　from bs4 import BeautifulSoup

　　import requests

　　from mysql_config import insert_db

　　from mysql_config import select_db

　　def my_test():

　　db_table = 'lottery_table'

　　url = '

　　res = requests.get(url)

　　content = res.content

　　soup = BeautifulSoup(content, 'html.parser', from_encoding='utf8')

　　c_t = soup.select('#trend_table')[0]

　　trs = c_t.contents[4:]

　　for tr in trs:

　　if tr == '\n':

　　continue

　　tds = tr.select('td')

　　issue = tds[1].text

　　time_str = tds[0].text

　　num_code = tr.table.text.replace('\n0', ',').replace('\n', ',').strip(',')

　　print('期號：%s\t時間：%s\t號碼:%s' % (str(issue), str(time_str), str(num_code)))

　　issue_db = select_db(issue, db_table)

　　try: 鄭州婦科醫院

　　if issue_db == issue:

　　insert_db(db_table, issue_db, time_str, num_code)

　　print('新增%s到%s成功' % (issue_db, db_table))

　　except Exception as e:

　　print('%s 已經存在!' % issue_db)

　　print(e)

　　if __name__ == '__main__':

　　flag = 0

　　now = datetime.datetime.now()

　　sched_time = datetime.datetime(now.year, now.month, now.day, now.hour, now.minute, now.second) +\

　　datetime.timedelta(seconds=3)

　　while True:

　　now = datetime.datetime.now()

　　if sched_time < now:

　　time.sleep(3)

　　print(now)

　　my_test()

　　flag = 1

　　else:

　　if flag == 1:

　　sched_time = sched_time + datetime.timedelta(minutes=2)

　　flag = 0

python 爬蟲實現增量去重和定時爬取例項

相關文章

python 爬蟲 實現增量去重和定時爬取例項

相關文章

python 爬蟲實現增量去重和定時爬取例項