在之前的文章中我們簡單瞭解了一下Scrapy 框架和安裝及目錄的介紹,本章我們將根據 scrapy 框架實現部落格園首頁部落格的爬取及資料處理。
我們先在自定義的目錄中通過命令列來構建一個 scrapy 專案目錄
scrapy startproject scrapyCnblogs
生成一下目錄:
然後在終端命令列中輸入
scrapy genspider cnblogs cnblogs.com
在 scrapCnblogs/spiders 下就會生成一個 cnblogs.py 的檔案,程式碼如下:
1 # -*- coding: utf-8 -*- 2 import scrapy 3 4 5 class CnblogsSpider(scrapy.Spider): 6 name = 'cnblogs' 7 allowed_domains = ['cnblogs.com'] 8 start_urls = ['http://cnblogs.com/'] 9 10 def parse(self, response): 11 pass
在上面的程式碼中 allowed_domains 將限制爬蟲的作用範圍,start_urls 是爬蟲的起始 url,爬取的結果將在 parse 方法中進行資料處理。
我們要做的案例是爬取部落格園首頁的部落格列表,連結為 https://www.cnblogs.com/,內容如下:
本次我們就只爬取網頁中間的部落格列表中的:部落格名稱,連結和作者 這三個資訊,分別定義為 title,link,author。
在頁面篩選資訊時我用的是我比較習慣用的 xpath,scrapy 框架整合了該模組,使用起來也非常方便。xpath 的使用規則:https://www.cnblogs.com/weijiutao/p/10879871.html
我們先通過控制檯來查詢到我們要獲取的欄位資訊:
我們根據xpath獲取到的資訊將上面的 cnblogs.py 檔案改為如下:
1 # -*- coding: utf-8 -*- 2 import scrapy 3 4 5 # 建立一個爬蟲類 6 class CnblogsSpider(scrapy.Spider): 7 # 爬蟲名 8 name = 'cnblogs' 9 # 允許爬蟲作用的範圍 10 allowed_domains = ['cnblogs.com'] 11 # 爬蟲起始的url 12 start_urls = ['https://www.cnblogs.com'] 13 14 def parse(self, response): 15 # 通過 scrapy 自帶的xpath匹配出所有部落格的根結點列表集合 16 post_list = response.xpath("//div[@class='post_item_body']") 17 # 遍歷根節點集合 18 for post in post_list: 19 # extract() 將匹配的物件結果轉換為Unicode字串,不加 extract() 結果為xpath匹配物件 20 # title 21 title = post.xpath("./h3/a[@class='titlelnk']/text()").extract()[0] 22 # link 23 link = post.xpath("./h3/a[@class='titlelnk']/@href").extract()[0] 24 # author 25 author = post.xpath("./div[@class='post_item_foot']/a/text()").extract()[0] 26 print(title + link + author)
上面的程式碼中,我們只需要定義 allowed_domains 和 start_urls 這兩個欄位,scrapy 就會自動幫我們去進行內容爬取來,並且通過 parse() 方法返回 response 的結果,然後我們再通過 scrapy 提供的 xpath 模組過濾我們想要的資訊就可以了。
在終端輸出:
scrapy crawl cnblogs
其中 cnblogs 使我們在上面的程式碼中定義的爬蟲名 name 的值,意思是啟動該爬蟲,然後我們就可以在控制檯檢視我們的列印結果了:
上面的程式碼已經大大簡化了我們很久之前寫的爬蟲的文章,接下來我們再來將 scrapy 其他的檔案串聯起來。
在 scrapyCnblogs/items.py 中寫入一下程式碼:
1 # -*- coding: utf-8 -*- 2 3 # Define here the models for your scraped items 4 # 5 # See documentation in: 6 # https://docs.scrapy.org/en/latest/topics/items.html 7 8 import scrapy 9 10 11 class ScrapycnblogsItem(scrapy.Item): 12 # define the fields for your item here like: 13 # 標題 14 title = scrapy.Field() 15 # 連結 16 link = scrapy.Field() 17 # 作者 18 author = scrapy.Field()
該程式碼是將我們想要過濾的資訊進行定義,我們在此檔案中定義了一個 ScrapycnblogsItem 的類,裡面定義了 title,link 和 author 三個欄位。
接下來將剛才寫的 cnblogs.py 改為如下程式碼:
1 # -*- coding: utf-8 -*- 2 import scrapy 3 # 引入 ScrapycnblogsItem 類 4 from scrapyCnblogs.items import ScrapycnblogsItem 5 6 7 # 建立一個爬蟲類 8 class CnblogsSpider(scrapy.Spider): 9 # 爬蟲名 10 name = 'cnblogs' 11 # 允許爬蟲作用的範圍 12 allowed_domains = ['cnblogs.com'] 13 # 爬蟲起始的url 14 start_urls = ['https://www.cnblogs.com'] 15 16 def parse(self, response): 17 # 通過 scrapy 自帶的xpath匹配出所有部落格的根結點列表集合 18 post_list = response.xpath("//div[@class='post_item_body']") 19 # 遍歷根節點集合 20 for post in post_list: 21 # extract() 將匹配的物件結果轉換為Unicode字串,不加 extract() 結果為xpath匹配物件 22 # title 23 title = post.xpath("./h3/a[@class='titlelnk']/text()").extract()[0] 24 # link 25 link = post.xpath("./h3/a[@class='titlelnk']/@href").extract()[0] 26 # author 27 author = post.xpath("./div[@class='post_item_foot']/a/text()").extract()[0] 28 29 # 將我們得到的資料封裝到一個 `ScrapycnblogsItem` 物件 30 item = ScrapycnblogsItem() 31 item['title'] = title 32 item['link'] = link 33 item['author'] = author 34 35 # 將獲取的資料交給pipelines 36 yield item
在上面的程式碼中,我們引入了剛剛定義的 ScrapycnblogsItem 類,然後將爬取過濾的資訊複製給 item ,最後 yield 出去,這裡所做的操作會將我們的資訊交給 scrapyCnblogs/pipelines.py 檔案,接下來我們就只需要在 pipelines.py 檔案中對我們的資料進行操作就可以了。
pipelines.py 程式碼如下:
1 # -*- coding: utf-8 -*- 2 3 # Define your item pipelines here 4 # 5 # Don't forget to add your pipeline to the ITEM_PIPELINES setting 6 # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html 7 8 import json 9 10 11 class ScrapycnblogsPipeline(object): 12 # __init__ 方法是可選的,作為類的初始化方法 13 def __init__(self): 14 self.filename = open('cnblogs.json', 'w') 15 16 # process_item 方法是必須寫的,用來處理item資料 17 def process_item(self, item, spider): 18 text = json.dumps(dict(item), ensure_ascii=False) + ',\n' 19 self.filename.write(text.encode('utf-8')) 20 return item 21 22 # close_spider 方法是可選的,結束時呼叫這個方法 23 def close_spider(self, spider): 24 self.filename.close()
在上面的程式碼中 ScrapycnblogsPipeline 類中的 process_item() 方法就會接收到 cnblogs.py 所返回的 item 資訊,我們在 process_item() 方法中將所獲取的 item 寫入到了一個 cnblogs.json 的檔案中。
最後還需要做的一步就是去 scrapyCnblogs/settings.py 檔案中放開我們定義的這個管道檔案了。
settings.py 程式碼如下:
1 # -*- coding: utf-8 -*- 2 3 # Scrapy settings for scrapyCnblogs project 4 # 5 # For simplicity, this file contains only settings considered important or 6 # commonly used. You can find more settings consulting the documentation: 7 # 8 # https://docs.scrapy.org/en/latest/topics/settings.html 9 # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html 10 # https://docs.scrapy.org/en/latest/topics/spider-middleware.html 11 12 BOT_NAME = 'scrapyCnblogs' 13 14 SPIDER_MODULES = ['scrapyCnblogs.spiders'] 15 NEWSPIDER_MODULE = 'scrapyCnblogs.spiders' 16 17 # Crawl responsibly by identifying yourself (and your website) on the user-agent 18 # USER_AGENT = 'scrapyCnblogs (+http://www.yourdomain.com)' 19 20 # Obey robots.txt rules 21 ROBOTSTXT_OBEY = True 22 23 # Configure maximum concurrent requests performed by Scrapy (default: 16) 24 # CONCURRENT_REQUESTS = 32 25 26 # Configure a delay for requests for the same website (default: 0) 27 # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay 28 # See also autothrottle settings and docs 29 # 延遲 3 秒獲取資訊 30 DOWNLOAD_DELAY = 3 31 # The download delay setting will honor only one of: 32 # CONCURRENT_REQUESTS_PER_DOMAIN = 16 33 # CONCURRENT_REQUESTS_PER_IP = 16 34 35 # Disable cookies (enabled by default) 36 # COOKIES_ENABLED = False 37 38 # Disable Telnet Console (enabled by default) 39 # TELNETCONSOLE_ENABLED = False 40 41 # Override the default request headers: 42 # 定義報頭資訊 43 DEFAULT_REQUEST_HEADERS = { 44 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36', 45 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 46 'Accept-Language': 'en', 47 } 48 49 # Enable or disable spider middlewares 50 # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html 51 # SPIDER_MIDDLEWARES = { 52 # 'scrapyCnblogs.middlewares.ScrapycnblogsSpiderMiddleware': 543, 53 # } 54 55 # Enable or disable downloader middlewares 56 # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html 57 # DOWNLOADER_MIDDLEWARES = { 58 # 'scrapyCnblogs.middlewares.ScrapycnblogsDownloaderMiddleware': 543, 59 # } 60 61 # Enable or disable extensions 62 # See https://docs.scrapy.org/en/latest/topics/extensions.html 63 # EXTENSIONS = { 64 # 'scrapy.extensions.telnet.TelnetConsole': None, 65 # } 66 67 # Configure item pipelines 68 # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html 69 # 管道檔案 70 ITEM_PIPELINES = { 71 'scrapyCnblogs.pipelines.ScrapycnblogsPipeline': 300, # 優先順序,越小優先順序越高 72 } 73 74 # Enable and configure the AutoThrottle extension (disabled by default) 75 # See https://docs.scrapy.org/en/latest/topics/autothrottle.html 76 # AUTOTHROTTLE_ENABLED = True 77 # The initial download delay 78 # AUTOTHROTTLE_START_DELAY = 5 79 # The maximum download delay to be set in case of high latencies 80 # AUTOTHROTTLE_MAX_DELAY = 60 81 # The average number of requests Scrapy should be sending in parallel to 82 # each remote server 83 # AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 84 # Enable showing throttling stats for every response received: 85 # AUTOTHROTTLE_DEBUG = False 86 87 # Enable and configure HTTP caching (disabled by default) 88 # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings 89 # HTTPCACHE_ENABLED = True 90 # HTTPCACHE_EXPIRATION_SECS = 0 91 # HTTPCACHE_DIR = 'httpcache' 92 # HTTPCACHE_IGNORE_HTTP_CODES = [] 93 # HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
在上面的程式碼第 70 行,我們就設定了 ScrapycnblogsPipelines 類的管道,同時還設定了一下延遲時間和報頭資訊,延遲時間如果不設定的話經常訪問可能會被對方發現察覺而封IP,在爬取多頁面資訊的時候也有助於上次資訊處理成功後再處理下次請求,避免資料不完整,報頭資訊是模擬瀏覽器的資訊,都是為了增加我們的資訊爬取成功率。
最後我們在終端輸入:
scrapy crawl cnblogs
在我們的目錄下就會生成一個 cnblogs.json 的檔案,如下:
至此我們就完成了一個相對完整的基於 scrapy 框架爬取部落格園首頁部落格列表的爬蟲了!