Scrapy+Chromium+代理+selenium

liyinanCoder發表於2019-02-16

原文網址 : https://flycode.co/archives/79390

上週說到scrapy的基本入門。這周來寫寫其中遇到的代理和js渲染的坑。

js渲染

js是爬蟲中畢竟麻煩處理的一塊。通常的解決辦法是通過抓包，然後檢視request資訊，接著捕獲ajax返回的訊息。
但是，如果遇到一些js渲染特別複雜的情況，這種辦法就非常非常的麻煩。所以我們採用了selenium這個包，用它來呼叫chromium完成js渲染的問題。

安裝

安裝selenium
安裝chromium
安裝chromium-drive

tip:為什麼選擇chromium而不是chrome。我之前裝的就是chrome。但是安裝chrome之後還需要安裝chrome-drive，而很多linux發行版的包管理沒有現成的chrome包和chrome-drive包，自己去找的話很容易出現chrome-drive和chrome版本不一致而導致不能使用。

為了減少因為安裝環境所帶來的煩惱。我們這邊用docker來解決。
Dockerfile

FROM alpine:3.8
COPY requirements.txt /tmp
RUN apk update 
    && apk add --no-cache xvfb python3 python3-dev curl libxml2-dev libxslt-dev libffi-dev gcc musl-dev 
    && apk add --no-cache libgcc openssl-dev chromium=68.0.3440.75-r0 libexif udev chromium-chromedriver=68.0.3440.75-r0 
    && curl https://bootstrap.pypa.io/get-pip.py | python3 
    && adduser -g chromegroup -D chrome 
    && pip3 install -r /tmp/requirements.txt && rm /tmp/requirements.txt
USER chrome

tip：這邊還有一個坑，chrome和chromium都不能在root模式下執行，而且也不安全。所以最好是建立一個使用者來執行。

使用docker的時候，run時候需要加--privileged引數

如果你需要了解如何在root使用者下執行chrome，請閱讀這篇博文
Ubuntu16.04安裝Chrome瀏覽器及解決root不能開啟的問題

requirements.txt

Scrapy
selenium
Twisted
PyMysql
pyvirtualdisplay

把requirements.txt和Dockerfile放在一起。
並在目錄下使用docker命令docker build -t "chromium-scrapy-image" .

至於為什麼要安裝xvfb和pyvirtualdisplay。因為chromium的headless模式下不能處理帶賬號密碼的問題。待會就會說到了。

Redhat和Debian可以去包倉庫找一下最新的chromium和對應的chromium-drive下載安裝就可以了。版本一定要是對應的！這邊使用chromium=68.0.3440.75-r0和chromium-chromedriver=68.0.3440.75-r0。

修改`Scrapy`的`Middleware`

使用了chromium之後，我們在middlewares.py檔案修改一下。我們的設想是讓chromium來替代掉request請求。所以我們修改了DownloaderMiddleware

#DownloaderMiddleware
class DemoDownloaderMiddleware(object):
    def __init__(self):
        chrome_options = webdriver.ChromeOptions()
        # 啟用headless模式
        chrome_options.add_argument(`--headless`)
        # 關閉gpu
        chrome_options.add_argument(`--disable-gpu`)
        # 關閉影像顯示
        chrome_options.add_argument(`--blink-settings=imagesEnabled=false`) 
        self.driver = webdriver.Chrome(chrome_options=chrome_options)
        
    def __del__(self):
        self.driver.quit()
        
    @classmethod
    def from_crawler(cls, crawler):
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s
        
    def process_request(self, request, spider):
        # chromium處理
        # ...
        return HtmlResponse(url=request.url, 
        body=self.driver.page_source, 
        request=request, 
        encoding=`utf-8`, 
        status=200)
        
    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info(`Spider opened: %s` % spider.name)

tip：這邊我們只有一箇中介軟體來處理request。也就是說，所有的邏輯都要經過這兒。所以直接返回了response。

這就解決了selenium和chromium的安裝問題。

`chromium`不支援`headless`問題

如果你安裝的chromium版本太老，不支援headless，不著急。之前我們安裝的xvfb和pyvirtualdisplay就派上用場了。

from pyvirtualdisplay import Display
...
>>>
chrome_options.add_argument(`--headless`)

<<<
# chrome_options.add_argument(`--headless`)
display=Display(visible=0,size=(800,800))
display.start()
...

>>>
self.driver.quit()

<<<
self.driver.quit()
display.stop()
...

我們模擬出了一個顯示介面，這個時候，不管chromium開不開啟headless，都能在我們的伺服器上執行了。

代理

因為我們已經用chromium替換了request。所以我們做的代理也不能在Scrapy中來處理。
我們需要直接用chromium來處理IP代理問題。

這是不使用chromium之前使用代理的辦法

class DemoProxyMiddleware(object):
    # overwrite process request

    def process_request(self, request, spider):
        # Set the location of the proxy
        request.meta[`proxy`] = "https://proxy.com:8080"

        # Use the following lines if your proxy requires authentication
        
        proxy_user_pass = "username:password"
        encoded_user_pass = base64.b64encode(proxy_user_pass.encode(`utf-8`))

        # setup basic authentication for the proxy
        request.headers[`Proxy-Authorization`] = `Basic ` + str(encoded_user_pass, encoding="utf-8")

如果你的IP代理不需要賬號密碼的話，只需要把後面三行刪除了就可以了。

根據上面這段程式碼，我們也不難猜出chromium解決代理的方法了。

chrome_options.add_argument(`--proxy=proxy.com:8080`)

只需要加一段argument就可以了。

那解決帶賬號密碼的辦法呢？

解決`chromium`下帶賬號密碼的代理問題

先建立一個py檔案

import string
import zipfile


def create_proxyauth_extension(proxy_host, proxy_port,
                               proxy_username, proxy_password,
                               scheme=`http`, plugin_path=None):
    """代理認證外掛

    args:
        proxy_host (str): 你的代理地址或者域名（str型別）
        proxy_port (int): 代理埠號（int型別）
        proxy_username (str):使用者名稱（字串）
        proxy_password (str): 密碼 （字串）
    kwargs:
        scheme (str): 代理方式 預設http
        plugin_path (str): 擴充套件的絕對路徑

    return str -> plugin_path
    """

    if plugin_path is None:
        plugin_path = `vimm_chrome_proxyauth_plugin.zip`

    manifest_json = """
    {
        "version": "1.0.0",
        "manifest_version": 2,
        "name": "Chrome Proxy",
        "permissions": [
            "proxy",
            "tabs",
            "unlimitedStorage",
            "storage",
            "<all_urls>",
            "webRequest",
            "webRequestBlocking"
        ],
        "background": {
            "scripts": ["background.js"]
        },
        "minimum_chrome_version":"22.0.0"
    }
    """

    background_js = string.Template(
        """
        var config = {
                mode: "fixed_servers",
                rules: {
                  singleProxy: {
                    scheme: "${scheme}",
                    host: "${host}",
                    port: parseInt(${port})
                  },
                  bypassList: ["foobar.com"]
                }
              };
    
        chrome.proxy.settings.set({value: config, scope: "regular"}, function() {});
    
        function callbackFn(details) {
            return {
                authCredentials: {
                    username: "${username}",
                    password: "${password}"
                }
            };
        }
    
        chrome.webRequest.onAuthRequired.addListener(
                    callbackFn,
                    {urls: ["<all_urls>"]},
                    [`blocking`]
        );
        """
    ).substitute(
        host=proxy_host,
        port=proxy_port,
        username=proxy_username,
        password=proxy_password,
        scheme=scheme,
    )
    with zipfile.ZipFile(plugin_path, `w`) as zp:
        zp.writestr("manifest.json", manifest_json)
        zp.writestr("background.js", background_js)

    return plugin_path

使用方式

    proxyauth_plugin_path = create_proxyauth_extension(
        proxy_host="host",
        proxy_port=port,
        proxy_username="user",
        proxy_password="pwd")
    chrome_options.add_extension(proxyauth_plugin_path)

這樣就完成了chromium的代理了。但是，如果你開啟了headless模式，這個方法會提示錯誤。所以解決辦法就是，關閉headless模式。
至於怎麼在沒有gui的情況下使用chromium。在之前已經提到過，使用xvfb和pyvirtualdisplay就可以了。

selenium+python設定爬蟲代理IP的方法
2019-04-17
Python爬蟲
Python自動化測試Selenium+chrome連線HTTP代理全攻略
2022-12-16
PythonChromeHTTP
Selenium，Selenium使用方法（三）
2019-05-11
Selenium，Selenium使用方法（二）
2019-05-11
Selenium，Selenium使用方法（一）
2019-05-11
selenium4 selenium_handless的使用
2024-08-05
selenium基本
2020-10-19
Selenium 使用
2024-06-20
Selenium用法詳解 -- Selenium3 常用方法
2018-10-29
Selenium安裝
2018-12-20
python selenium Demo
2024-03-15
Python
Selenium - 簡介
2019-08-05
Selenium 4 使用
2024-06-10
selenium模組
2023-03-27
Selenium用法詳解 - - selenium自動化測試概要
2018-10-29
Selenium實戰教程系列（三）--- Selenium中的動作
2018-10-27
Selenium Part1：框架搭建與selenium框架介紹
2018-05-07
框架
Selenium用法詳解 -- selenium八大定位詳解
2018-10-29
自動化測試框架Selenium的使用——安裝Selenium
2018-09-18
框架
python+selenium環境搭建，pip安裝selenium失敗
2018-08-05
Python
『心善淵』Selenium3.0基礎 — 23、Selenium的元素等待
2021-07-13
Python Selenium簡介
2018-12-21
Python
selenium 鍵盤操作
2018-09-06
Selenium一安裝
2018-05-03
selenium的基本使用
2024-03-05
selenium和PhantomJS概述
2023-11-29
JS
Python之Selenium 框架
2019-10-14
Python框架
linux 上部署 selenium
2020-05-08
Linux
selenium get_attribute
2020-09-04
pip安裝selenium
2020-08-12
selenium測試心得
2020-10-31
selenium隱式等待
2020-10-08
測試---selenium(5)
2020-11-23
Selenium 初體驗
2019-05-23
scrapy中的selenium
2019-03-04
Selenium等待事件Waits
2024-11-25
事件AI
Selenium等待條件
2024-11-10
Selenium的等待操作
2024-06-26

Scrapy+Chromium+代理+selenium

js渲染

安裝

修改Scrapy的Middleware

chromium不支援headless問題

代理

解決chromium下帶賬號密碼的代理問題

相關文章

修改`Scrapy`的`Middleware`

`chromium`不支援`headless`問題

解決`chromium`下帶賬號密碼的代理問題