Scrapy入門-第一個爬蟲專案

blue_zy發表於2018-07-23

原文網址 : https://blog.csdn.net/blue_zy/article/details/81161068

爬蟲

前言

千辛萬苦安裝完Scrapy，當然要馬上體驗一下啦。詳見：Mac安裝Scrapy及踩坑經驗

本文采用循序漸進的方式，一步步寫出一個完整的爬蟲，包括

使用Scrapy建立專案
使用Scrapy爬取整個網頁
使用Scrapy爬取所需元素
使用Scrapy儲存資料到json檔案

相當於Scrapy入門教程中的基礎篇，如果希望學習Scrapy這個強大的爬蟲框架，只要懂一點點Python語法，可以跟著一起來動手了。

建立專案

只需一行命令即可建立名為 tutorial 的Scrapy專案：

scrapy startproject tutorial

然後 cd tutorial 進入專案目錄，通過 tree tutorial 命令看一下整個專案的結構

tutorial/
    scrapy.cfg            # deploy configuration file
    tutorial/             # project's Python module, you'll import your code from here
        __init__.py
        items.py          # project items definition file
        middlewares.py    # project middlewares file
        pipelines.py      # project pipelines file
        settings.py       # project settings file
        spiders/          # a directory where you'll later put your spiders
            __init__.py

OK，這是建立專案的第一步，然後我們再通過

cd /tutorial
scrapy genspider QuoteSpider

建立爬蟲的模板檔案，QuoteSpider就是檔名，在當前目錄下生成一個QuoteSpider.py檔案

然後我們通過PyCharm開啟這個 tutorial 專案

專案結構圖：
這裡寫圖片描述

在 QuoteSpider.py 我們開始編寫爬蟲程式碼，其他幾個檔案都有自己的作用，但是本文暫時不需要用到，所以就不介紹了。有需要的童鞋可以參考知乎的這篇文章：Scrapy爬蟲框架教程（一）– Scrapy入門

爬取整個網頁

修改QuotesSpider.py檔案

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)

name 是爬蟲的名字，作為唯一標識不可重複
parse() 方法進行頁面解析，我們這裡直接儲存為html檔案

寫完程式碼，怎麼開始執行呢？

終端輸入：

scrapy crawl quotes

就能看到當前目錄下多了兩個html檔案：
這裡寫圖片描述

這就完成了我們最最簡單的一個爬蟲了，總結一下：

scrapy startproject name 建立專案
scrapy genspider spider_name 建立爬蟲檔案
編寫程式碼
scrapy crawl spider_name 執行爬蟲

提取所需元素

通過爬取整頁，我們掌握了最基本的爬蟲，但是實際上，通常我們需要的只是網頁中一部分對我們有用的資訊，那麼就涉及到了元素的過濾和篩選。在Scrapy中，我們可以通過 shell 來獲取網頁元素。

舉個例子：

scrapy shell 'http://quotes.toscrape.com/page/1/'

你會看到以下資訊：

2018-07-22 23:59:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x103186400>
[s]   item       {}
[s]   request    <GET http://quotes.toscrape.com/page/1/>
[s]   response   <200 http://quotes.toscrape.com/page/1/>
[s]   settings   <scrapy.settings.Settings object at 0x10400f7f0>
[s]   spider     <DefaultSpider 'default' at 0x1042f6dd8>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects 
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
>>>

在這裡，我們再通過css選擇器來選擇所需的元素。例如：

>>> response.css('title::text').extract_first()
'Quotes to Scrape'

title：網頁中的標籤，也就是我們所需要的元素
extrac_first() ：相當於返回title中的第一個值，因為css返回的是一個列表
做個試驗，我們這樣寫 response.css(‘title::text’).extract()
輸出結果為：[‘Quotes to Scrape’]，說明返回值就是一個列表

其他所有元素都能通過這個方式來得到，因此接下來我們通過一個完整的例子來實踐一下。

獲取所需元素並存到json檔案

目標：獲取網頁的text，author，tags元素，並儲存下來

步驟1：獲取所需的元素（通過CSS）

        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }

這些標籤我們可以從html檔案中通過檢視原始碼得到他們的層級關係

步驟2：完整的QuotesSpider.py程式碼

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }

步驟3：儲存資料到json檔案

通過以下命令：

scrapy crawl quotes -o quotes.json

-o：輸出
quotes.json：儲存的json檔名

最終儲存下來的結果：
這裡寫圖片描述

總結

就以上的Demo而言，Scrapy這個爬蟲框架的上手難度算是比較低的，不需要額外的配置，也沒有很複雜的模板，基本達到了開箱即用的效果。而且編寫程式碼的過程中也非常簡單。一邊閱讀官方文件，一邊自己操作，花了晚上2個小時的時間就能做出這樣的效果，個人還是很滿意的（為Scrapy打call~）。至於後面的進階操作，還需要慢慢實踐，不斷踩坑總結。一起努力吧！

scrapy入門教程()部署爬蟲專案
2018-09-27
爬蟲
精通Scrapy網路爬蟲【一】第一個爬蟲專案
2021-06-19
爬蟲
我的第一個 scrapy 爬蟲
2019-02-16
爬蟲
python爬蟲學習筆記 4.2 （Scrapy入門案例（建立專案））
2020-04-30
Python爬蟲筆記
python爬蟲初探--第一個python爬蟲專案
2018-05-18
Python爬蟲
Python網路爬蟲4 - scrapy入門
2018-05-29
Python爬蟲
Scrapy使用入門及爬蟲代理配置
2020-11-11
爬蟲
Python爬蟲教程-31-建立 Scrapy 爬蟲框架專案
2018-09-04
Python爬蟲框架
第一個分散式爬蟲專案
2018-08-15
分散式爬蟲
利用scrapy建立初始Python爬蟲專案
2018-03-04
Python爬蟲
Python爬蟲深造篇(四)——Scrapy爬蟲框架啟動一個真正的專案
2021-11-08
Python爬蟲框架
scrapy 框架新建一個爬蟲專案詳細步驟
2018-06-09
框架爬蟲
scrapy通用專案和爬蟲程式碼模板
2021-03-22
爬蟲
Java爬蟲入門(一)——專案介紹
2018-08-06
Java爬蟲
專案之爬蟲入門（豆瓣TOP250）
2020-11-19
爬蟲
（python）爬蟲----八個專案帶你進入爬蟲的世界
2021-07-17
Python爬蟲
Python爬蟲教程-32-Scrapy 爬蟲框架專案 Settings.py 介紹
2018-09-06
Python爬蟲框架
爬蟲入門第一章
2020-10-18
爬蟲
爬蟲--Scrapy簡易爬蟲
2020-10-07
爬蟲
手把手教你寫網路爬蟲（4）：Scrapy入門
2018-05-05
爬蟲
Python爬蟲入門學習實戰專案（一）
2020-02-18
Python爬蟲
Scrapy爬蟲-草稿
2018-09-08
爬蟲
Scrapy爬蟲框架
2024-11-13
爬蟲框架
不踩坑的Python爬蟲：Python爬蟲開發與專案實戰，從爬蟲入門 Python
2021-12-17
Python爬蟲
Python學習筆記——爬蟲之Scrapy專案實戰
2018-09-03
Python筆記爬蟲
Python爬蟲入門，8個常用爬蟲技巧盤點
2018-12-12
Python爬蟲
爬蟲入門
2024-04-13
爬蟲
python爬蟲利器 scrapy和scrapy-redis 詳解一入門demo及內容解析
2020-10-29
Python爬蟲Redis
爬蟲（9） - Scrapy框架(1) | Scrapy 非同步網路爬蟲框架
2022-07-05
爬蟲框架非同步
Python Scrapy 爬蟲（二）：scrapy 初試
2018-08-13
Python爬蟲
Python網路爬蟲實戰專案大全 32個Python爬蟲專案demo
2019-04-24
Python爬蟲
python爬蟲Scrapy框架
2018-11-21
Python爬蟲框架
scrapy爬蟲代理池
2018-08-28
爬蟲
爬蟲實戰scrapy
2018-03-11
爬蟲
Python爬蟲—Scrapy框架
2020-10-04
Python爬蟲框架
【Python篇】scrapy爬蟲
2020-11-29
Python爬蟲
32個Python爬蟲專案demo
2018-08-26
Python爬蟲
爬蟲專案
2019-06-07
爬蟲