python 爬蟲 實現增量去重和定時爬取例項

ckxllf發表於2020-03-06

  前言: 在爬蟲過程中,我們可能需要重複的爬取同一個網站,為了避免重複的資料存入我們的資料庫中 透過實現增量去重 去解決這一問題 本文還針對了那些需要實時更新的網站 增加了一個定時爬取的功能。

  解決思路:

  1.獲取目標url

  2.解析網頁

  3.存入資料庫(增量去重)

  4.異常處理

  5.實時更新(定時爬取)

  下面為資料庫的配置 mysql_congif.py:

  import pymysql

  def insert_db(db_table, issue, time_str, num_code):

  host = '127.0.0.1'

  user = 'root'

  password = 'root'

  port = 3306

  db = 'lottery'

  data_base = pymysql.connect(host=host, user=user, password=password, port=port, db=db)

  cursor = data_base.cursor()

  try:

  sql = "INSERT INTO %s VALUES ('%s','%s','%s')" % (db_table, issue, time_str, num_code)

  cursor.execute(sql)

  data_base.commit()

  except ValueError as e:

  print(e)

  data_base.rollback()

  finally:

  cursor.close()

  data_base.close()

  def select_db(issue, db_table):

  host = '127.0.0.1'

  user = 'root'

  password = 'root'

  port = 3306

  db = 'lottery'

  data_base = pymysql.connect(host=host, user=user, password=password, port=port, db=db)

  cursor = data_base.cursor()

  try:

  sql = "SELECT '%s' FROM %s " % (issue, db_table)

  cursor.execute(sql)

  data_base.commit()

  except ValueError as e:

  print(e)

  data_base.rollback()

  finally:

  return issue

  接下來是主要程式碼 test.py:

  # 使用bs4進行網頁解析

  # 實現了增量去重

  # 實現了定時爬取

  import datetime

  import time

  from bs4 import BeautifulSoup

  import requests

  from mysql_config import insert_db

  from mysql_config import select_db

  def my_test():

  db_table = 'lottery_table'

  url = '

  res = requests.get(url)

  content = res.content

  soup = BeautifulSoup(content, 'html.parser', from_encoding='utf8')

  c_t = soup.select('#trend_table')[0]

  trs = c_t.contents[4:]

  for tr in trs:

  if tr == '\n':

  continue

  tds = tr.select('td')

  issue = tds[1].text

  time_str = tds[0].text

  num_code = tr.table.text.replace('\n0', ',').replace('\n', ',').strip(',')

  print('期號:%s\t時間:%s\t號碼:%s' % (str(issue), str(time_str), str(num_code)))

  issue_db = select_db(issue, db_table)

  try: 鄭州婦科醫院

  if issue_db == issue:

  insert_db(db_table, issue_db, time_str, num_code)

  print('新增%s到%s成功' % (issue_db, db_table))

  except Exception as e:

  print('%s 已經存在!' % issue_db)

  print(e)

  if __name__ == '__main__':

  flag = 0

  now = datetime.datetime.now()

  sched_time = datetime.datetime(now.year, now.month, now.day, now.hour, now.minute, now.second) +\

  datetime.timedelta(seconds=3)

  while True:

  now = datetime.datetime.now()

  if sched_time < now:

  time.sleep(3)

  print(now)

  my_test()

  flag = 1

  else:

  if flag == 1:

  sched_time = sched_time + datetime.timedelta(minutes=2)

  flag = 0


來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/69945560/viewspace-2678784/,如需轉載,請註明出處,否則將追究法律責任。

相關文章