隨堂筆記

hcxuke發表於2020-11-11

隨堂筆記

scrapy框架使用基本流程

  • 建立專案: scrapy startproject dushu

  • 建立爬蟲: cd /dushu; scrapy genspider guoxue ““www.dushu.com””

  • 開啟guoxue.py,開始寫程式碼.

    class GuoxueSpider(scrapy.Spider):
        name = 'guoxue'
        allowed_domains = ['www.dushu.com']
        # 起始地址,一般需要修改.
        start_urls = ['https://www.dushu.com/book/1617.html']
    
        def parse(self, response):
            # 找到詳情頁的超連結
            detail_url_list = response.xpath('//div[@class="book-info"]//h3/a/@href')
            for detail_url in detail_url_list.getall():
                detail = 'https://www.dushu.com' + detail_url
                yield scrapy.Request(url=detail, callback=self.detail_parse)
    		
            # 下一頁地址.
            for i in range(2, 11):
                next_page = 'https://www.dushu.com/book/1617_%d.html' % i
                yield scrapy.Request(url=next_page, callback=self.parse)
    	
        # 解析詳情頁的內容
        def detail_parse(self, response):
            book_title = response.xpath('string(//div[@class="book-title"])').extract_first()
            book_img = response.xpath('//div[@class="book-pic"]//img/@src').extract_first()
            price = response.xpath('//p[@class="price"]/span/text()').extract_first()
            author = response.xpath('string(//div[@class="book-details-left"]//table/tbody/tr[1]/td[2])').extract_first()
            book_brief, author_brief = response.xpath('//div[contains(@class, "txtsummary")]/text()')[:2].extract()
            book_brief, author_brief = book_brief.strip(), author_brief.strip()
            item = DushuItem()
            item['book_title'] = book_title
            item['book_img'] = book_img
            item['price'] = price
            item['author'] = author
            item['book_brief'] = book_brief
            item['author_brief'] = author_brief
            yield item
    
  • scrapy shell,利用這個shell可以進行程式碼除錯.

em


- scrapy shell,利用這個shell可以進行程式碼除錯.

- crawler spider

相關文章