1、Scrapy框架
Scrapy是用純Python實現一個為了爬取網站資料、提取結構性資料而編寫的應用框架,用途非常廣泛。
框架的力量,使用者只需要定製開發幾個模組就可以輕鬆的實現一個爬蟲,用來抓取網頁內容以及各種圖片,非常之方便。
Scrapy 使用了 Twisted‘twɪstɪd非同步網路框架來處理網路通訊,可以加快我們的下載速度,不用自己去實現非同步框架,並且包含了各種中介軟體介面,可以靈活的完成各種需求。
Scrapy Engine(引擎): 負責Spider、ItemPipeline、Downloader、Scheduler中間的通訊,訊號、資料傳遞等。
Scheduler(排程器): 它負責接受引擎傳送過來的Request請求,並按照一定的方式進行整理排列,入隊,當引擎需要時,交還給引擎。
Downloader(下載器):負責下載Scrapy Engine(引擎)傳送的所有Requests請求,並將其獲取到的Responses交還給Scrapy Engine(引擎),由引擎交給Spider來處理,
Spider(爬蟲):它負責處理所有Responses,從中分析提取資料,獲取Item欄位需要的資料,並將需要跟進的URL提交給引擎,再次進入Scheduler(排程器),
Item Pipeline(管道):它負責處理Spider中獲取到的Item,並進行進行後期處理(詳細分析、過濾、儲存等)的地方.
Downloader Middlewares(下載中介軟體):你可以當作是一個可以自定義擴充套件下載功能的元件。
Spider Middlewares(Spider中介軟體):你可以理解為是一個可以自定擴充套件和操作引擎和Spider中間通訊的功能元件(比如進入Spider的Responses;和從Spider出去的Requests)
2、Puppeteer渲染
Puppeteer 是 Chrome 開發團隊在 2017 年釋出的一個 Node.js 包,用來模擬 Chrome 瀏覽器的執行。
為了爬取js渲染的html頁面,我們需要用瀏覽器來解析js後生成html。在scrapy中可以利用pyppeteer來實現對應功能。
完整程式碼 ?scrapy-pyppeteer.zip
我們需要新建專案中middlewares.py檔案(./專案名/middlewares.py)
import websockets
from scrapy.http import HtmlResponse
from logging import getLogger
import asyncio
import pyppeteer
import logging
from concurrent.futures._base import TimeoutError
import base64
import sys
import random
pyppeteer_level = logging.WARNING
logging.getLogger('websockets.protocol').setLevel(pyppeteer_level)
logging.getLogger('pyppeteer').setLevel(pyppeteer_level)
PY3 = sys.version_info[0] >= 3
def base64ify(bytes_or_str):
if PY3 and isinstance(bytes_or_str, str):
input_bytes = bytes_or_str.encode('utf8')
else:
input_bytes = bytes_or_str
output_bytes = base64.urlsafe_b64encode(input_bytes)
if PY3:
return output_bytes.decode('ascii')
else:
return output_bytes
class ProxyMiddleware(object):
USER_AGENT = open('useragents.txt').readlines()
def process_request(self, request, spider):
# 代理伺服器
proxyHost = "t.16yun.cn"
proxyPort = "31111"
# 代理隧道驗證資訊
proxyUser = "username"
proxyPass = "password"
request.meta['proxy'] = "http://{0}:{1}".format(proxyHost, proxyPort)
# 新增驗證頭
encoded_user_pass = base64ify(proxyUser + ":" + proxyPass)
request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass
# 設定IP切換頭(根據需求)
tunnel = random.randint(1, 10000)
request.headers['Proxy-Tunnel'] = str(tunnel)
request.headers['User-Agent'] = random.choice(self.USER_AGENT)
class PyppeteerMiddleware(object):
def __init__(self, **args):
"""
init logger, loop, browser
:param args:
"""
self.logger = getLogger(__name__)
self.loop = asyncio.get_event_loop()
self.browser = self.loop.run_until_complete(
pyppeteer.launch(headless=True))
self.args = args
def __del__(self):
"""
close loop
:return:
"""
self.loop.close()
def render(self, url, retries=1, script=None, wait=0.3, scrolldown=False, sleep=0,
timeout=8.0, keep_page=False):
"""
render page with pyppeteer
:param url: page url
:param retries: max retry times
:param script: js script to evaluate
:param wait: number of seconds to wait before loading the page, preventing timeouts
:param scrolldown: how many times to page down
:param sleep: how many long to sleep after initial render
:param timeout: the longest wait time, otherwise raise timeout error
:param keep_page: keep page not to be closed, browser object needed
:param browser: pyppetter browser object
:param with_result: return with js evaluation result
:return: content, [result]
"""
# define async render
async def async_render(url, script, scrolldown, sleep, wait, timeout, keep_page):
try:
# basic render
page = await self.browser.newPage()
await asyncio.sleep(wait)
response = await page.goto(url, options={'timeout': int(timeout * 1000)})
if response.status != 200:
return None, None, response.status
result = None
# evaluate with script
if script:
result = await page.evaluate(script)
# scroll down for {scrolldown} times
if scrolldown:
for _ in range(scrolldown):
await page._keyboard.down('PageDown')
await asyncio.sleep(sleep)
else:
await asyncio.sleep(sleep)
if scrolldown:
await page._keyboard.up('PageDown')
# get html of page
content = await page.content()
return content, result, response.status
except TimeoutError:
return None, None, 500
finally:
# if keep page, do not close it
if not keep_page:
await page.close()
content, result, status = [None] * 3
# retry for {retries} times
for i in range(retries):
if not content:
content, result, status = self.loop.run_until_complete(
async_render(url=url, script=script, sleep=sleep, wait=wait,
scrolldown=scrolldown, timeout=timeout, keep_page=keep_page))
else:
break
# if need to return js evaluation result
return content, result, status
def process_request(self, request, spider):
"""
:param request: request object
:param spider: spider object
:return: HtmlResponse
"""
if request.meta.get('render'):
try:
self.logger.debug('rendering %s', request.url)
html, result, status = self.render(request.url)
return HtmlResponse(url=request.url, body=html, request=request, encoding='utf-8',
status=status)
except websockets.exceptions.ConnectionClosed:
pass
@classmethod
def from_crawler(cls, crawler):
return cls(**crawler.settings.get('PYPPETEER_ARGS', {}))
然後修改專案配置檔案 (./專案名/settings.py)
DOWNLOADER_MIDDLEWARES = {
'scrapypyppeteer.middlewares.PyppeteerMiddleware': 543,
'scrapypyppeteer.middlewares.ProxyMiddleware': 100,
}
然後我們執行程式
本作品採用《CC 協議》,轉載必須註明作者和本文連結