爬取子頁

小杰哥001發表於2018-08-24

概述

在日常爬蟲過程中,我們還要爬取當前頁面的分頁頁面,這種情況下,普通爬蟲方式已經不行了,所有今天來嘗試子頁面的爬取

開始工作

1.建立專案:

scrapy startproject pqejym
複製程式碼

2.建立爬蟲器:

cd pqejym
scrapy genspider btdy www.btbtdy.net
複製程式碼

3.開啟PyCharm

通過PyCharm開啟專案目錄

4.設定setting.py檔案

ROBOTSTXT_OBEY = False  
##這是爬蟲規則,我們選擇False不遵守,可以爬取更多東西
複製程式碼
USER_AGENT =  'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.89 Safari/537.36'
##這是請求頭資訊的USER_AGENT,我們給它設定成這樣,改制可以通過瀏覽器的開發者工具獲取
複製程式碼

5.編寫爬蟲檔案

分析:因為是爬取子頁面,我們當前的起始url是當前頁面,要想獲取子頁面,我們得拿到子頁面的連結,然後再去解析獲取子頁面的內容 先寫個函式獲取子頁面連結:

 def parse(self, response):
        links = response.xpath('//div[@class="cts_ms"]/p/a/@href')
        for link in links:
            print(link.extract())
            yield response.follow(link,self.parse_content)
複製程式碼

我們通過xpath解析,拿到連結標籤,然後通過迴圈遍歷,follow是scrapy的內建方法, 對scrapy中使用yield迴圈處理網頁url的分析

首先,scrapy框架對含有yield關鍵字的parse()方法的呼叫是以迭代的方式進行的。相當於

for n in parse(self, response):
        pass
複製程式碼

其次,python將parse()函式視為生成器,但首次呼叫才會開始執行程式碼,每次迭代請求(即上面的for迴圈)才會執行yield處的迴圈程式碼,生成每次迭代的值。 我們試著執行:

 scrapy crawl btdy
複製程式碼

部分結果展示:

/btdy/dy10862.html
/btdy/dy10598.html
/btdy/dy10186.html
/btdy/dy10216.html
/btdy/dy9749.html
/btdy/dy8611.html
/btdy/dy11748.html
/btdy/dy6403.html
/btdy/dy5165.html
/btdy/dy6219.html
/btdy/dy5164.html
/btdy/dy4356.html
/btdy/dy1670.html
/btdy/dy1669.html
/btdy/dy1668.html
複製程式碼

接下來我們寫個函式解析子頁面內容:

def parse_content(self,response):
        print(response.xpath('//title'))
        movie = PqejymItem()
        title = response.xpath('//h1/text()').extract()
        content = response.xpath('//div[@class="c05"]/span/text()').extract()
        magnet = response.xpath('//*[@id="nucms_downlist"]/div[2]/ul/li/span/a/@href').extract()
        movie['title'] = title
        movie['content'] = content
        movie['magnet'] = magnet
        yield movie

複製程式碼

我們的在items中定義一下:

import scrapy


class PqejymItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    content = scrapy.Field()
    magnet = scrapy.Field()

複製程式碼

總體btdy程式碼如下:

import scrapy
from pqejym.items import PqejymItem


class BtdySpider(scrapy.Spider):
    name = 'btdy'
    allowed_domains = ['www.btbtdy.net']
    start_urls = ['http://www.btbtdy.net/']

    def parse(self, response):
        links = response.xpath('//div[@class="cts_ms"]/p/a/@href')
        for link in links:
            print(link.extract())
            yield response.follow(link,self.parse_content)
    def parse_content(self,response):
        print(response.xpath('//title'))
        movie = PqejymItem()
        title = response.xpath('//h1/text()').extract()
        content = response.xpath('//div[@class="c05"]/span/text()').extract()
        magnet = response.xpath('//*[@id="nucms_downlist"]/div[2]/ul/li/span/a/@href').extract()
        movie['title'] = title
        movie['content'] = content
        movie['magnet'] = magnet
        yield movie
複製程式碼

6.執行檔案

scrapy crawl btdy
複製程式碼

部分結果

2018-08-24 17:53:43 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.btbtdy.net/btdy/dy13375.html>
{'content': [], 'magnet': [], 'title': ['那些年,我們正年輕']}
2018-08-24 17:53:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.btbtdy.net/btdy/dy13350.html> (referer: http://www.btbtdy.net/)
[<Selector xpath='//title' data='<title>愛情進化論全集-高清BT種子下載_迅雷下載-BT電影天堂</tit'>]
2018-08-24 17:53:43 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.btbtdy.net/btdy/dy13315.html>
{'content': [], 'magnet': [], 'title': ['愛情進化論']}
2018-08-24 17:53:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.btbtdy.net/btdy/dy13379.html> (referer: http://www.btbtdy.net/)
[<Selector xpath='//title' data='<title>天盛長歌全集-高清BT種子下載_迅雷下載-BT電影天堂</titl'>]
2018-08-24 17:53:43 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.btbtdy.net/btdy/dy13350.html>
{'content': [], 'magnet': [], 'title': ['天盛長歌']}
[<Selector xpath='//title' data='<title>夜天子全集-高清BT種子下載_迅雷下載-BT電影天堂</title'>]
2018-08-24 17:53:43 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.btbtdy.net/btdy/dy13379.html>
{'content': [], 'magnet': [], 'title': ['夜天子']}
2018-08-24 17:53:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.btbtdy.net/btdy/dy13243.html> (referer: http://www.btbtdy.net/)
[<Selector xpath='//title' data='<title>進擊的巨人 第三季全集-高清BT種子下載_迅雷下載-BT電影天堂<'>]
2018-08-24 17:53:44 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.btbtdy.net/btdy/dy13243.html>
{'content': [], 'magnet': [], 'title': ['進擊的巨人 第三季']}
2018-08-24 17:53:44 [scrapy.core.engine] INFO: Closing spider (finished)
2018-08-24 17:53:44 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 45111,

複製程式碼

相關文章