python網路爬蟲--專案實戰--scrapy嵌入selenium,晶片廠級聯評論爬取(6)

太原浪子發表於2020-10-23

一、目標

爬取晶片廠電影級聯頁面的評論

二、分析

2.1 網頁分析

經過研究發現,該網頁的評論是動態載入的。故我們本次採用selenium來解決。本次只拿資料不進行儲存。

三、完整程式碼

xpc.py

import scrapy


class XpcSpider(scrapy.Spider):
    name = 'xpc'
    allowed_domains = ['www.xinpianchang.com']
    start_urls = ['https://www.xinpianchang.com/a10975710?from=ArticleList']

    def parse(self, response):
        results = response.xpath("//ul[contains(@class, 'comment-list')]/li/div/div/i[@class='text']/text()").extract()
        print(results)

middlewares.py

該py檔案中只需要改 process_request函式即可

class ScrapyadvancedDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called

        if isinstance(spider, XpcSpider):
            # 在這可以很方便的新增 隨機UA,Cookie,Proxy
            print("切點我來了", request.url)

            # if isinstance(spider, XpcSpider):
            # 呼叫谷歌瀏覽器進行請求
            driver = WebDriver()
            driver.get(request.url)
            sleep(2)
            # 獲取請求的內容
            content = driver.page_source

            # 使用請求內容構造Response
            response = HtmlResponse(request.url, body=content.encode("utf-8"))
            return response
        # return None

相關文章