Centoschina
專案要求
爬取centoschina_cn的所有問題,包括文章標題和內容
資料庫表設計
庫表設計:
資料展示:
專案亮點
-
低耦合,高內聚。
爬蟲專有settings
custom_settings = custom_settings_for_centoschina_cn
custom_settings_for_centoschina_cn = { 'MYSQL_USER': 'root', 'MYSQL_PWD': '123456', 'MYSQL_DB': 'questions', }
-
DownloaderMiddleware使用
class CentoschinaDownloaderMiddleware: # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the downloader middleware does not modify the # passed objects. @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) return s # 處理請求 def process_request(self, request, spider): # Called for each request that goes through the downloader # middleware. # Must either: # - return None: continue processing this request 繼續執行下一步操作,不處理預設返回None # - or return a Response object 直接返回響應, 如scrapy和pyppeteer不需要用下載器中介軟體訪問外網,直接返回響應, pyppeteer有外掛,一般和scrapy還能配合,selenium不行,沒有外掛 # - or return a Request object 將請求返回到schdular的排程佇列中供以後重新訪問 # - or raise IgnoreRequest: process_exception() methods of # installed downloader middleware will be called return None # 處理響應 def process_response(self, request, response, spider): # Called with the response returned from the downloader. # Must either; # - return a Response object 返回響應結果 # - return a Request object 結果不對(判斷結果對不對一般判斷狀態碼和內容大小)一般返回request,也是將請求返回到schdular的排程佇列中供以後重新訪問 # - or raise IgnoreRequest return response # 處理異常:如超時錯誤等 def process_exception(self, request, exception, spider): # Called when a download handler or a process_request() # (from other downloader middleware) raises an exception. # Must either: # - return None: continue processing this exception 繼續執行下一步,沒有異常 # - return a Response object: stops process_exception() chain 如果其返回一個 Response 物件,則已安裝的中介軟體鏈的 process_response() 方法被呼叫。Scrapy將不會呼叫任何其他中介軟體的 process_exception() 方法。 # - return a Request object: stops process_exception() chain 將請求返回到schdular的排程佇列中供以後重新訪問 pass def spider_opened(self, spider): spider.logger.info("Spider opened: %s" % spider.name)
-
DownloaderMiddleware中拋棄請求寫法
- 適用場景:請求異常,換代理或者換cookie等操作
# from scrapy.exceptions import IgnoreRequest # raise IgnoreRequest(f'Failed to retrieve {request.url} after {max_retries} retries')
例子:處理下載異常並重試請求
import logging from scrapy.exceptions import IgnoreRequest class RetryExceptionMiddleware: def __init__(self): self.logger = logging.getLogger(__name__) def process_exception(self, request, exception, spider): # 記錄異常資訊 self.logger.warning(f'Exception {exception} occurred while processing {request.url}') # 檢查是否達到重試次數限制 max_retries = 3 retries = request.meta.get('retry_times', 0) + 1 if retries <= max_retries: self.logger.info(f'Retrying {request.url} (retry {retries}/{max_retries})') # 增加重試次數 request.meta['retry_times'] = retries return request else: self.logger.error(f'Failed to retrieve {request.url} after {max_retries} retries') raise IgnoreRequest(f'Failed to retrieve {request.url} after {max_retries} retries')
例子:切換代理
import random class SwitchProxyMiddleware: def __init__(self, proxy_list): self.proxy_list = proxy_list self.logger = logging.getLogger(__name__) @classmethod def from_crawler(cls, crawler): proxy_list = crawler.settings.get('PROXY_LIST') return cls(proxy_list) def process_exception(self, request, exception, spider): self.logger.warning(f'Exception {exception} occurred while processing {request.url}') # 切換代理 proxy = random.choice(self.proxy_list) self.logger.info(f'Switching proxy to {proxy}') request.meta['proxy'] = proxy # 重試請求 return request
-
piplines中拋棄item寫法
- 適用場景:資料清洗、去重、驗證等操作
# from scrapy.exceptions import DropItem # raise DropItem("Duplicate item found: %s" % item)
-
儲存到檔案(透過命令)
from scrapy.cmdline import execute execute(['scrapy', 'crawl', 'centoschina_cn', '-o', 'questions.csv'])