概述
在日常爬蟲過程中,我們還要爬取當前頁面的分頁頁面,這種情況下,普通爬蟲方式已經不行了,所有今天來嘗試子頁面的爬取
開始工作
1.建立專案:
scrapy startproject pqejym
複製程式碼
2.建立爬蟲器:
cd pqejym
scrapy genspider btdy www.btbtdy.net
複製程式碼
3.開啟PyCharm
通過PyCharm開啟專案目錄
4.設定setting.py檔案
ROBOTSTXT_OBEY = False
##這是爬蟲規則,我們選擇False不遵守,可以爬取更多東西
複製程式碼
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.89 Safari/537.36'
##這是請求頭資訊的USER_AGENT,我們給它設定成這樣,改制可以通過瀏覽器的開發者工具獲取
複製程式碼
5.編寫爬蟲檔案
分析:因為是爬取子頁面,我們當前的起始url是當前頁面,要想獲取子頁面,我們得拿到子頁面的連結,然後再去解析獲取子頁面的內容 先寫個函式獲取子頁面連結:
def parse(self, response):
links = response.xpath('//div[@class="cts_ms"]/p/a/@href')
for link in links:
print(link.extract())
yield response.follow(link,self.parse_content)
複製程式碼
我們通過xpath解析,拿到連結標籤,然後通過迴圈遍歷,follow是scrapy的內建方法, 對scrapy中使用yield迴圈處理網頁url的分析
首先,scrapy框架對含有yield關鍵字的parse()方法的呼叫是以迭代的方式進行的。相當於
for n in parse(self, response):
pass
複製程式碼
其次,python將parse()函式視為生成器,但首次呼叫才會開始執行程式碼,每次迭代請求(即上面的for迴圈)才會執行yield處的迴圈程式碼,生成每次迭代的值。 我們試著執行:
scrapy crawl btdy
複製程式碼
部分結果展示:
/btdy/dy10862.html
/btdy/dy10598.html
/btdy/dy10186.html
/btdy/dy10216.html
/btdy/dy9749.html
/btdy/dy8611.html
/btdy/dy11748.html
/btdy/dy6403.html
/btdy/dy5165.html
/btdy/dy6219.html
/btdy/dy5164.html
/btdy/dy4356.html
/btdy/dy1670.html
/btdy/dy1669.html
/btdy/dy1668.html
複製程式碼
接下來我們寫個函式解析子頁面內容:
def parse_content(self,response):
print(response.xpath('//title'))
movie = PqejymItem()
title = response.xpath('//h1/text()').extract()
content = response.xpath('//div[@class="c05"]/span/text()').extract()
magnet = response.xpath('//*[@id="nucms_downlist"]/div[2]/ul/li/span/a/@href').extract()
movie['title'] = title
movie['content'] = content
movie['magnet'] = magnet
yield movie
複製程式碼
我們的在items中定義一下:
import scrapy
class PqejymItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
content = scrapy.Field()
magnet = scrapy.Field()
複製程式碼
總體btdy程式碼如下:
import scrapy
from pqejym.items import PqejymItem
class BtdySpider(scrapy.Spider):
name = 'btdy'
allowed_domains = ['www.btbtdy.net']
start_urls = ['http://www.btbtdy.net/']
def parse(self, response):
links = response.xpath('//div[@class="cts_ms"]/p/a/@href')
for link in links:
print(link.extract())
yield response.follow(link,self.parse_content)
def parse_content(self,response):
print(response.xpath('//title'))
movie = PqejymItem()
title = response.xpath('//h1/text()').extract()
content = response.xpath('//div[@class="c05"]/span/text()').extract()
magnet = response.xpath('//*[@id="nucms_downlist"]/div[2]/ul/li/span/a/@href').extract()
movie['title'] = title
movie['content'] = content
movie['magnet'] = magnet
yield movie
複製程式碼
6.執行檔案
scrapy crawl btdy
複製程式碼
部分結果
2018-08-24 17:53:43 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.btbtdy.net/btdy/dy13375.html>
{'content': [], 'magnet': [], 'title': ['那些年,我們正年輕']}
2018-08-24 17:53:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.btbtdy.net/btdy/dy13350.html> (referer: http://www.btbtdy.net/)
[<Selector xpath='//title' data='<title>愛情進化論全集-高清BT種子下載_迅雷下載-BT電影天堂</tit'>]
2018-08-24 17:53:43 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.btbtdy.net/btdy/dy13315.html>
{'content': [], 'magnet': [], 'title': ['愛情進化論']}
2018-08-24 17:53:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.btbtdy.net/btdy/dy13379.html> (referer: http://www.btbtdy.net/)
[<Selector xpath='//title' data='<title>天盛長歌全集-高清BT種子下載_迅雷下載-BT電影天堂</titl'>]
2018-08-24 17:53:43 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.btbtdy.net/btdy/dy13350.html>
{'content': [], 'magnet': [], 'title': ['天盛長歌']}
[<Selector xpath='//title' data='<title>夜天子全集-高清BT種子下載_迅雷下載-BT電影天堂</title'>]
2018-08-24 17:53:43 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.btbtdy.net/btdy/dy13379.html>
{'content': [], 'magnet': [], 'title': ['夜天子']}
2018-08-24 17:53:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.btbtdy.net/btdy/dy13243.html> (referer: http://www.btbtdy.net/)
[<Selector xpath='//title' data='<title>進擊的巨人 第三季全集-高清BT種子下載_迅雷下載-BT電影天堂<'>]
2018-08-24 17:53:44 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.btbtdy.net/btdy/dy13243.html>
{'content': [], 'magnet': [], 'title': ['進擊的巨人 第三季']}
2018-08-24 17:53:44 [scrapy.core.engine] INFO: Closing spider (finished)
2018-08-24 17:53:44 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 45111,
複製程式碼