"揭秘CentosChina爬蟲專案:掌握Scrapy框架的必備技巧與資料庫設計"

存子發表於2024-08-08

Centoschina

專案要求

爬取centoschina_cn的所有問題,包括文章標題和內容

資料庫表設計

庫表設計:

image-20240808161837432

資料展示:

image-20240808161722167

專案亮點

  • 低耦合,高內聚。

    爬蟲專有settings

    custom_settings = custom_settings_for_centoschina_cn
    
    custom_settings_for_centoschina_cn = {
        'MYSQL_USER': 'root',
        'MYSQL_PWD': '123456',
        'MYSQL_DB': 'questions',
    }
    
  • DownloaderMiddleware使用

    class CentoschinaDownloaderMiddleware:
        # Not all methods need to be defined. If a method is not defined,
        # scrapy acts as if the downloader middleware does not modify the
        # passed objects.
    
        @classmethod
        def from_crawler(cls, crawler):
            # This method is used by Scrapy to create your spiders.
            s = cls()
            crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
            return s
    
        # 處理請求
        def process_request(self, request, spider):
            # Called for each request that goes through the downloader
            # middleware.
    
            # Must either:
            # - return None: continue processing this request 繼續執行下一步操作,不處理預設返回None
            # - or return a Response object 直接返回響應, 如scrapy和pyppeteer不需要用下載器中介軟體訪問外網,直接返回響應, pyppeteer有外掛,一般和scrapy還能配合,selenium不行,沒有外掛
            # - or return a Request object 將請求返回到schdular的排程佇列中供以後重新訪問
            # - or raise IgnoreRequest: process_exception() methods of
            #   installed downloader middleware will be called
            return None
    
        # 處理響應
        def process_response(self, request, response, spider):
            # Called with the response returned from the downloader.
    
            # Must either;
            # - return a Response object 返回響應結果
            # - return a Request object 結果不對(判斷結果對不對一般判斷狀態碼和內容大小)一般返回request,也是將請求返回到schdular的排程佇列中供以後重新訪問
            # - or raise IgnoreRequest
            return response
    
        # 處理異常:如超時錯誤等
        def process_exception(self, request, exception, spider):
            # Called when a download handler or a process_request()
            # (from other downloader middleware) raises an exception.
    
            # Must either:
            # - return None: continue processing this exception 繼續執行下一步,沒有異常
            # - return a Response object: stops process_exception() chain 如果其返回一個 Response 物件,則已安裝的中介軟體鏈的 process_response() 方法被呼叫。Scrapy將不會呼叫任何其他中介軟體的 process_exception() 方法。
            # - return a Request object: stops process_exception() chain 將請求返回到schdular的排程佇列中供以後重新訪問
            pass
    
        def spider_opened(self, spider):
            spider.logger.info("Spider opened: %s" % spider.name)
    
  • DownloaderMiddleware中拋棄請求寫法

    • 適用場景:請求異常,換代理或者換cookie等操作
    # from scrapy.exceptions import IgnoreRequest
    # raise IgnoreRequest(f'Failed to retrieve {request.url} after {max_retries} retries')
    

    例子:處理下載異常並重試請求

    import logging
    from scrapy.exceptions import IgnoreRequest
    
    class RetryExceptionMiddleware:
        def __init__(self):
            self.logger = logging.getLogger(__name__)
    
        def process_exception(self, request, exception, spider):
            # 記錄異常資訊
            self.logger.warning(f'Exception {exception} occurred while processing {request.url}')
            
            # 檢查是否達到重試次數限制
            max_retries = 3
            retries = request.meta.get('retry_times', 0) + 1
            
            if retries <= max_retries:
                self.logger.info(f'Retrying {request.url} (retry {retries}/{max_retries})')
                # 增加重試次數
                request.meta['retry_times'] = retries
                return request
            else:
                self.logger.error(f'Failed to retrieve {request.url} after {max_retries} retries')
                raise IgnoreRequest(f'Failed to retrieve {request.url} after {max_retries} retries')
    
    

    例子:切換代理

    import random
    
    class SwitchProxyMiddleware:
        def __init__(self, proxy_list):
            self.proxy_list = proxy_list
            self.logger = logging.getLogger(__name__)
    
        @classmethod
        def from_crawler(cls, crawler):
            proxy_list = crawler.settings.get('PROXY_LIST')
            return cls(proxy_list)
    
        def process_exception(self, request, exception, spider):
            self.logger.warning(f'Exception {exception} occurred while processing {request.url}')
            
            # 切換代理
            proxy = random.choice(self.proxy_list)
            self.logger.info(f'Switching proxy to {proxy}')
            request.meta['proxy'] = proxy
            
            # 重試請求
            return request
    
    
  • piplines中拋棄item寫法

    • 適用場景:資料清洗、去重、驗證等操作
    # from scrapy.exceptions import DropItem
    # raise DropItem("Duplicate item found: %s" % item)
    
  • 儲存到檔案(透過命令)

    from scrapy.cmdline import execute
    execute(['scrapy', 'crawl', 'centoschina_cn', '-o', 'questions.csv'])
    

更多精緻內容:

相關文章