scrapy-redis記錄之,重寫make_request_from_data和make_requests_from_url

python實驗室發表於2020-12-27

scrapy-redis記錄,重寫make_request_from_data和make_requests_from_url

起因是最近爬了某電商商品,因為用了scrapy-redis來爬,這樣可以停機,重新爬,但是單機版有start_requests方法,然而,我的start_url是儲存在redis伺服器中的,需要從redis接收第一條url那麼start_requests方法就不合適。

經過搜尋和大佬的經驗,重寫了make_request_from_data和make_requests_from_url實現了redis中接收start_url。

先看下我使用的機器(樹莓派和PC)。發現用了redis之後簡直爽歪歪,段時間內不擔心重爬等糟心事情。可以斷開,樹莓派不關機,自動儲存items到本地中,爬完之後,我在從reids存到mysql中。

在這裡插入圖片描述

scrapy-redis關鍵原始碼

首先,程式碼中要繼承RedisSpider。

from scrapy_redis.spiders import RedisSpider
class 你的爬蟲類(RedisSpider):
    redis_key = "computer:start_urls"
    #...程式碼並不完整,需要自己新增
    def make_request_from_data(self, data):
        data = json.loads(data)
        url = data.get('url')
        print(url)
        return self.make_requests_from_url(url)
    def make_requests_from_url(self, url):
        '''準備開始爬取首頁資料
        '''
         # 第幾頁,每頁30條資訊
        page = 1  
        # 根據銷量排行爬取
        keyword = ['聯想(Lenovo)']
        meta = {"keyword": keyword[0], "page": page}
        req_headers = copy.deepcopy(self.headers)
        req_headers["Referer"] = url
        return scrapy.Request(url, headers=req_headers, callback=self.pagination_parse, meta=meta,
                      dont_filter=True)

看下scrapy-reids中的關鍵原始碼如何實現,衝redis中拿到url。首先,上面繼承的RedisSpider也是繼承了RedisMixin和Spider兩個類。 這次之用到了這個類。

裡面就一個方法,具體就看RedisMixin和Spider了。

class RedisSpider(RedisMixin, Spider):
	@classmethod
	def from_crawler(self, crawler, *args, **kwargs):
    	obj = super(RedisSpider, self).from_crawler(crawler, *args, **kwargs)
    	obj.setup_redis(crawler)
    	return obj

    #Spider類
    class Spider(object_ref):
        def start_requests(self):
            cls = self.__class__
            if not self.start_urls and hasattr(self, 'start_url'):
                raise AttributeError(
                    "Crawling could not start: 'start_urls' not found "
                    "or empty (but found 'start_url' attribute instead, "
                    "did you miss an 's'?)")
            if method_is_overridden(cls, Spider, 'make_requests_from_url'):
                warnings.warn(
                    "Spider.make_requests_from_url method is deprecated; it "
                    "won't be called in future Scrapy releases. Please "
                    "override Spider.start_requests method instead (see %s.%s)." % (
                        cls.__module__, cls.__name__
                    ),
                )
                for url in self.start_urls:
                    yield self.make_requests_from_url(url)
            else:
                for url in self.start_urls:
                    yield Request(url, dont_filter=True)
    	#最後根據這個方法實現你的start_requests,裡面是Request,引數等都自己把握即可
        def make_requests_from_url(self, url):
            """ This method is deprecated. """
            return Request(url, dont_filter=True)
            
    #RedisMixin        
    class RedisMixin(object):
        def start_requests(self):
            #這裡直接返回一個request的方法。返回一批start請求
            """Returns a batch of start requests from redis."""
            return self.next_requests() 
            
    	def next_requests(self):
            """Returns a request to be scheduled or none."""
            use_set = self.settings.getbool('REDIS_START_URLS_AS_SET', 
                      defaults.START_URLS_AS_SET)
            fetch_one = self.server.spop if use_set else self.server.lpop
            # XXX: Do we need to use a timeout here?
            found = 0
            # TODO: Use redis pipeline execution.
            while found < self.redis_batch_size:
                #開始關鍵的地方,redis_key就是外面推進redis伺服器的start_url
                data = fetch_one(self.redis_key)
                if not data:
                    # Queue empty.
                    break
                #關鍵呼叫,這裡就key重構make_request_from_data,其中data就包含了start_url
                #然後直接返回req了
                req = self.make_request_from_data(data)
                if req:
                    yield req
                    found += 1
                else:
                    self.logger.debug("Request not made from data: %r", data)
            if found:
                self.logger.debug("Read %s requests from '%s'", found, self.redis_key)
    
        def make_request_from_data(self, data):
            """Returns a Request instance from data coming from Redis.
            By default, ``data`` is an encoded URL. You can override this method to
            provide your own message decoding.
            Parameters
            ----------
            data : bytes
                Message from redis.
            """
            #最後的實現在這裡,把start_url放進去就可以了。
            url = bytes_to_str(data, self.redis_encoding)
            return self.make_requests_from_url(url)
            ```

相關文章