本文首發於知乎
本文使用多執行緒實現一個簡易爬蟲框架,讓我們只需要關注網頁的解析,不用自己設定多執行緒、佇列等事情。呼叫形式類似scrapy,而諸多功能還不完善,因此稱為簡易爬蟲框架。
這個框架實現了Spider
類,讓我們只需要寫出下面程式碼,即可多執行緒執行爬蟲
class DouBan(Spider):
def __init__(self):
super(DouBan, self).__init__()
self.start_url = 'https://movie.douban.com/top250'
self.filename = 'douban.json' # 覆蓋預設值
self.output_result = False
self.thread_num = 10
def start_requests(self): # 覆蓋預設函式
yield (self.start_url, self.parse_first)
def parse_first(self, url): # 只需要yield待爬url和回撥函式
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
movies = soup.find_all('div', class_ = 'info')[:5]
for movie in movies:
url = movie.find('div', class_ = 'hd').a['href']
yield (url, self.parse_second)
nextpage = soup.find('span', class_ = 'next').a
if nextpage:
nexturl = self.start_url + nextpage['href']
yield (nexturl, self.parse_first)
else:
self.running = False # 表明執行到這裡則不會繼續新增待爬URL佇列
def parse_second(self, url):
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
mydict = {}
title = soup.find('span', property = 'v:itemreviewed')
mydict['title'] = title.text if title else None
duration = soup.find('span', property = 'v:runtime')
mydict['duration'] = duration.text if duration else None
time = soup.find('span', property = 'v:initialReleaseDate')
mydict['time'] = time.text if time else None
yield mydict
if __name__ == '__main__':
douban = DouBan()
douban.run()
複製程式碼
可以看到這個使用方式和scrapy非常相似
- 繼承類,只需要寫解析函式(因為是簡易框架,因此還需要寫請求函式)
- 用yield返回資料或者新的請求及回撥函式
- 自動多執行緒(scrapy是非同步)
- 執行都一樣只要
run
- 可以設定是否儲存到檔案等,只是沒有考慮可擴充套件性(資料庫等)
下面我們來說一說它是怎麼實現的
我們可以對比下面兩個版本,一個是上一篇文章中的使用方法,另一個是進行了一些修改,將一些功能抽象出來,以便擴充套件功能。
上一篇文章版本程式碼請讀者自行點選連結去看,下面是修改後的版本程式碼。
import requests
import time
import threading
from queue import Queue, Empty
import json
from bs4 import BeautifulSoup
def run_time(func):
def wrapper(*args, **kw):
start = time.time()
func(*args, **kw)
end = time.time()
print('running', end-start, 's')
return wrapper
class Spider():
def __init__(self):
self.start_url = 'https://movie.douban.com/top250'
self.qtasks = Queue()
self.data = list()
self.thread_num = 5
self.running = True
def start_requests(self):
yield (self.start_url, self.parse_first)
def parse_first(self, url):
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
movies = soup.find_all('div', class_ = 'info')[:5]
for movie in movies:
url = movie.find('div', class_ = 'hd').a['href']
yield (url, self.parse_second)
nextpage = soup.find('span', class_ = 'next').a
if nextpage:
nexturl = self.start_url + nextpage['href']
yield (nexturl, self.parse_first)
else:
self.running = False
def parse_second(self, url):
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
mydict = {}
title = soup.find('span', property = 'v:itemreviewed')
mydict['title'] = title.text if title else None
duration = soup.find('span', property = 'v:runtime')
mydict['duration'] = duration.text if duration else None
time = soup.find('span', property = 'v:initialReleaseDate')
mydict['time'] = time.text if time else None
yield mydict
def start_req(self):
for task in self.start_requests():
self.qtasks.put(task)
def parses(self):
while self.running or not self.qtasks.empty():
try:
url, func = self.qtasks.get(timeout=3)
print('crawling', url)
for task in func(url):
if isinstance(task, tuple):
self.qtasks.put(task)
elif isinstance(task, dict):
self.data.append(task)
else:
raise TypeError('parse functions have to yield url-function tuple or data dict')
except Empty:
print('{}: Timeout occurred'.format(threading.current_thread().name))
print(threading.current_thread().name, 'finished')
@run_time
def run(self, filename=False):
ths = []
th1 = threading.Thread(target=self.start_req)
th1.start()
ths.append(th1)
for _ in range(self.thread_num):
th = threading.Thread(target=self.parses)
th.start()
ths.append(th)
for th in ths:
th.join()
if filename:
s = json.dumps(self.data, ensure_ascii=False, indent=4)
with open(filename, 'w', encoding='utf-8') as f:
f.write(s)
print('Data crawling is finished.')
if __name__ == '__main__':
Spider().run(filename='frame.json')
複製程式碼
這個改進主要思路如下
- 我們希望寫解析函式時,像scrapy一樣,用yield返回待抓取的URL和它對應的解析函式,於是就做了一個包含(URL,解析函式)的元組佇列,之後只要不斷從佇列中獲取元素,用函式解析url即可,這個提取的過程使用多執行緒
yield
可以返回兩種型別資料,一種是元組(URL,解析函式),一種是字典(即我們要的資料),通過判斷分別加入不同佇列中。元組佇列是不斷消耗和增添的過程,而字典佇列是一隻增加,最後再一起輸出到檔案中- 在
queue.get
時,加入了timeout
引數並做異常處理,保證每一個執行緒都能結束
這裡其實沒有特別的知識,也不需要解釋很多,讀者自己複製程式碼到文字檔案裡對比就知道了
然後框架的形式就是從第二種中,剝離一些通用的設定,讓使用者自定義每個爬蟲獨特的部分,完整程式碼如下(本文開頭的程式碼就是下面這塊程式碼的後半部分)
import requests
import time
import threading
from queue import Queue, Empty
import json
from bs4 import BeautifulSoup
def run_time(func):
def wrapper(*args, **kw):
start = time.time()
func(*args, **kw)
end = time.time()
print('running', end-start, 's')
return wrapper
class Spider():
def __init__(self):
self.qtasks = Queue()
self.data = list()
self.thread_num = 5
self.running = True
self.filename = False
self.output_result = True
def start_requests(self):
yield (self.start_url, self.parse)
def start_req(self):
for task in self.start_requests():
self.qtasks.put(task)
def parses(self):
while self.running or not self.qtasks.empty():
try:
url, func = self.qtasks.get(timeout=3)
print('crawling', url)
for task in func(url):
if isinstance(task, tuple):
self.qtasks.put(task)
elif isinstance(task, dict):
if self.output_result:
print(task)
self.data.append(task)
else:
raise TypeError('parse functions have to yield url-function tuple or data dict')
except Empty:
print('{}: Timeout occurred'.format(threading.current_thread().name))
print(threading.current_thread().name, 'finished')
@run_time
def run(self):
ths = []
th1 = threading.Thread(target=self.start_req)
th1.start()
ths.append(th1)
for _ in range(self.thread_num):
th = threading.Thread(target=self.parses)
th.start()
ths.append(th)
for th in ths:
th.join()
if self.filename:
s = json.dumps(self.data, ensure_ascii=False, indent=4)
with open(self.filename, 'w', encoding='utf-8') as f:
f.write(s)
print('Data crawling is finished.')
class DouBan(Spider):
def __init__(self):
super(DouBan, self).__init__()
self.start_url = 'https://movie.douban.com/top250'
self.filename = 'douban.json' # 覆蓋預設值
self.output_result = False
self.thread_num = 10
def start_requests(self): # 覆蓋預設函式
yield (self.start_url, self.parse_first)
def parse_first(self, url): # 只需要yield待爬url和回撥函式
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
movies = soup.find_all('div', class_ = 'info')[:5]
for movie in movies:
url = movie.find('div', class_ = 'hd').a['href']
yield (url, self.parse_second)
nextpage = soup.find('span', class_ = 'next').a
if nextpage:
nexturl = self.start_url + nextpage['href']
yield (nexturl, self.parse_first)
else:
self.running = False # 表明執行到這裡則不會繼續新增待爬URL佇列
def parse_second(self, url):
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
mydict = {}
title = soup.find('span', property = 'v:itemreviewed')
mydict['title'] = title.text if title else None
duration = soup.find('span', property = 'v:runtime')
mydict['duration'] = duration.text if duration else None
time = soup.find('span', property = 'v:initialReleaseDate')
mydict['time'] = time.text if time else None
yield mydict
if __name__ == '__main__':
douban = DouBan()
douban.run()
複製程式碼
我們這樣剝離之後,就只需要寫後半部分的程式碼,只關心網頁的解析,不用考慮多執行緒的實現了。
歡迎關注我的知乎專欄
專欄主頁:python程式設計
專欄目錄:目錄
版本說明:軟體及包版本說明