建立爬蟲專案

循序0010發表於2017-10-15

主要四個步驟：
1.執行scrapy startproject project_name 建立專案框架
  執行 scrapy genspider spider_name 'domain.com'建立爬蟲基本格式檔案;
2.編輯items/item.py檔案明確獲取的資料欄位;
3.編寫spiders/目錄下的爬蟲程式;
4.編寫儲存資料的pipelines.py檔案，注開啟setting.py檔案的ITEM_PIPELINES配置;
最後，爬取:
scrapy crawl spider_name
注意：setting.py 檔案設定
ROBOTSTXT_OBEY = False #不遵從robot協議
USER_AGENT='Mozilla/5.0 (iPhone 6s; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 MQQBrowser/7.6.0 Mobile/14E304 Safari/8536.25 MttCustomUA/2 QBWebViewType/1 WKType/1'  #user_agent設定

以下抓取鬥魚api資料為例子：
爬取鬥魚視訊網站資料介面
http://capi.douyucdn.cn/api/v1/getVerticalRoom?limit=20&offset=10

items.py檔案:
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class DouyuItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    nickname = scrapy.Field()
    link = scrapy.Field()


douyu_spider.py檔案：
# -*- coding: utf-8 -*-
import scrapy
import json
from douyu.items import DouyuItem

class DouyuSpiderSpider(scrapy.Spider):

    #定義爬蟲名稱
    name = 'douyu_spider'
    #允許的域名，可省
    allowed_domains = ['douyucdn.cn']
    #組裝url
    base_url = 'http://capi.douyucdn.cn/api/v1/getVerticalRoom?limit=20&offset='
    offset = 0
    #開始爬取的urls
    start_urls = [base_url+str(offset)]

    #解析函式
    def parse(self, response):
        # response.body 為二進位制編碼，需轉為utf-8
        results = json.loads(response.body.decode('utf-8'))['data']
        if len(results) == 0:
            print('crawing over ! spider stop!')
            exit()

        for li in results:
            item = DouyuItem()
            # print(li['nickname'])
            # print(li['room_src'])
            # print('#'*30)
            item['nickname'] = li['nickname']
            item['link'] = li['room_src']
            #yield 給pipelines
            yield  item

        #跟進 下載 另一頁面
        self.offset += 20
        href = self.base_url + str(self.offset)
        #注意，此處是 關鍵詞 yield 一個scrapy.Request()
        yield  scrapy.Request(href,callback=self.parse)


imagepipelines.py檔案:
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import scrapy
import os
from scrapy.pipelines.images import ImagesPipeline
#匯入setting.py檔案的常量
from scrapy.utils.project import get_project_settings

class ImagePipeline(ImagesPipeline):

    def get_media_requests(self, item, info):
        #注意此次 yield
       yield  scrapy.Request(item['link'])

    def item_completed(self, results, item, info):
        '''results的資料格式：
        [(True, {'checksum': '1e2cc73f256eff17f2d6a69388421f97', 'path': 'full/d8de0a55eb41d9a4ac4a39a0d1fc008f360d8b98.jpg', 'url': 'h
ttps://rpic.douyucdn.cn/appCovers/2017/09/24/630154_20170924201445_small.jpg'})]
        '''
        #獲取setting.py檔案的所有常量
        settings = get_project_settings()
        print('setting檔案配置:')
        images_folder = settings['IMAGES_STORE']
        data = [x['path'] for ok,x in results if ok]
        #替換 圖片名 ,data 資料格式:
        #['full/b008d9e23bdb013f672ca7527d704e049bf0c87c.jpg']
        image_ext = data[0].split('.')[-1]
        os.rename(images_folder+data[0],images_folder+item['nickname']+'.'+image_ext)
        os.rmdir(images_folder+'full')
        #這裡用return ，否則修改不了圖片名
        return  item
#注意，此檔案繼承scrapy的圖片下載類ImagesPipeline,並重寫了get_media_requests() 和 item_completed()

編寫好了imagepipeline.py 檔案後,需設定setting.py檔案：
#定義下載圖片的目錄
IMAGES_STORE = './images/'
USER_AGENT = 'Mozilla/5.0 (iPhone 6s; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 MQQBrowser/7.6.0 Mobile/14E304 Safari/8536.25 MttCustomUA/2 QBWebViewType/1 WKType/1'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False
ITEM_PIPELINES = {
   'douyu.pipelines.DouyuPipeline': 300,
   'douyu.imagepipelines.ImagePipeline': 100,
}

Scrapy建立爬蟲專案
2017-10-10
爬蟲
在scrapy框架下建立爬蟲專案，建立爬蟲檔案，執行爬蟲檔案
2018-03-01
框架爬蟲
Python爬蟲教程-31-建立 Scrapy 爬蟲框架專案
2018-09-04
Python爬蟲框架
如何快速建立一個爬蟲專案
2020-11-20
爬蟲
利用scrapy建立初始Python爬蟲專案
2018-03-04
Python爬蟲
爬蟲專案
2019-06-07
爬蟲
【爬蟲】爬蟲專案推薦 / 思路
2020-04-21
爬蟲
爬蟲小專案
2019-05-10
爬蟲
爬蟲專案部署
2018-04-03
爬蟲
爬蟲專案（一）爬蟲+jsoup輕鬆爬知乎
2017-02-07
爬蟲JS
奇伢爬蟲專案
2018-10-08
爬蟲
爬蟲專案總結
2020-08-31
爬蟲
網路爬蟲專案
2022-01-29
爬蟲
scrapyd 部署爬蟲專案
2018-03-22
爬蟲
Java 爬蟲專案實戰之爬蟲簡介
2018-11-24
Java爬蟲
爬蟲專案實戰（一）
2020-06-15
爬蟲
爬蟲實戰專案集合
2019-02-28
爬蟲
爬蟲的例項專案
2019-04-26
爬蟲
爬蟲實戰專案合集
2022-01-25
爬蟲
gerapy框架爬蟲專案部署
2018-09-27
框架爬蟲
爬蟲小專案（一）淘寶
2018-02-08
爬蟲
Python爬蟲專案整理
2017-04-15
Python爬蟲
網路爬蟲專案蒐集
2017-02-19
爬蟲
11.4. 爬蟲專案
2018-01-11
爬蟲
精通Scrapy網路爬蟲【一】第一個爬蟲專案
2021-06-19
爬蟲
python爬蟲初探--第一個python爬蟲專案
2018-05-18
Python爬蟲
Python網路爬蟲實戰專案大全 32個Python爬蟲專案demo
2019-04-24
Python爬蟲
Scrapy定向爬蟲教程(一)——建立執行專案和基本介紹
2016-10-13
爬蟲
Python3 大型網路爬蟲實戰 002 --- scrapy 爬蟲專案的建立及爬蟲的建立 --- 例項：爬取百度標題和CSDN部落格
2016-11-26
Python爬蟲
專案－－python網路爬蟲
2020-08-15
Python爬蟲
爬蟲專案:大麥網分析
2019-08-22
爬蟲
100爬蟲專案遷移
2018-09-19
爬蟲
Java爬蟲專案環境搭建
2018-09-18
Java爬蟲
網路爬蟲（python專案）
2018-12-04
爬蟲Python
企業資料爬蟲專案
2018-10-05
爬蟲
爬蟲開源專案及其思想
2016-08-24
爬蟲
中科院爬蟲完整專案
2018-07-10
爬蟲
33個Python爬蟲專案
2017-12-11
Python爬蟲

建立爬蟲專案

相關文章