【Python篇】scrapy爬蟲

Geeks_Chen發表於2020-11-29

原文網址 : https://blog.csdn.net/qq_31942007/article/details/110288194

Python爬蟲

前言

Scrapy是一個為了爬取網站或API資料，提取結構性資料而編寫的應用框架。可以應用在包括資料探勘，資訊處理或儲存歷史資料等一系列的程式中。

該文章是通過scrapy爬取https://www.yelp.com/關於[Restaurants]模組每個餐館的選單圖片」

1、工具安裝

1.1 安裝Python環境
Mac一般自帶 Python 2.7，不用額外安裝

1.2 安裝pip

sudo install pip

1.3 安裝pycharm
https://www.jetbrains.com/pycharm/

螢幕快照 2019-03-30 上午10.19.30.png

2、建立專案

scrapy startproject imageSpider

3、開始爬蟲

3.1 設定Item

class MenuImageItem(scrapy.Item):
   # define the fields for your item here like:
   image_urls = scrapy.Field()
   image_name = scrapy.Field()

3.2 設定pipelines

class TestspiderPipeline(ImagesPipeline):
    default_headers = {

    }

    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:
            yield Request(image_url, meta={'item': item, 'index': item['image_urls'].index(image_url)})

    def file_path(self, request, response=None, info=None):
        print '---------------'
        item = request.meta['item']  # 通過上面的meta傳遞過來item
        index = request.meta['index']  # 通過上面的index傳遞過來列表中當前下載圖片的下標
        # 圖片檔名，item['carname'][index]得到汽車名稱，request.url.split('/')[-1].split('.')[-1]得到圖片字尾jpg,png
        image_guid = request.url.split('/')[-1]
        # os.path.splitext()[0]
        # 圖片下載目錄 此處item['country']即需要前面item['country']=''.join()......,否則目錄名會變成\u97e9\u56fd\u6c7d\u8f66\u6807\u5fd7\xxx.jpg
        img_file_name = u'source/{0}/images/{1}'.format(item['image_name'],image_guid)

        return img_file_name

3.3 設定setting

ITEM_PIPELINES = {
    # 'imageDemo.pipelines.ImagedemoPipeline': 300,
    # 'imageDemo.pipelines.MenuImagePipeline': 300,
    'imageDemo.pipelines.SubDoubanImgDownloadPipeline': 300,
}
IMAGES_STORE = '/Users/chenxiao/Desktop/'

3.4 編寫spider


import scrapy
from scrapy.selector import Selector
from imageDemo.items import ImagedemoItem
from imageDemo.items import MenuImageItem

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

class imageSpider(scrapy.Spider):
    name = "imageSpider"
    allowed_domains = ["yelp.com"]
    start_urls = [
        # "https://www.yelp.com/search?cflt=restaurants&find_loc=San%20Francisco%2C%20CA&start=0",
        # "https://www.yelp.com/biz_photos/peoples-bistro-san-francisco-4?tab=menu",
        "https://www.yelp.com/biz_photos/peter-luger-brooklyn-2?tab=menu",
    ]

    def parse(self, response):
        sel = Selector(response)
        menuList = sel.xpath('//*[@id="super-container"]/div[2]/div/div[2]/div[2]/ul/li/div/img/@src').extract()
        print menuList
        # print '>>>>>>>'
        yield MenuImageItem(image_urls=menuList, image_name='peter-luger')

        # list = sel.xpath('//*[@id="wrap"]/div[3]/div[2]/div[2]/div/div[1]/div[1]/div/ul/li/div/div/div/div/div[2]/div[1]/div[1]/div[1]/div[1]/h3/a[@class="lemon--a__373c0__IEZFH link__373c0__29943 link-color--blue-dark__373c0__1mhJo link-size--inherit__373c0__2JXk5"]/text()').extract()
        # print list
        # for item in list:
        #     print item
        #     print '>>>>>'
            # yield ImagedemoItem(name=item)

4、下載資源

scrapy crawl imageSpider

螢幕快照 2019-03-31 下午5.13.03.png

5、專案原始碼

https://github.com/GeeksChen/ScrapyDemo

##注意：
1、command not found: scrapy
https://blog.csdn.net/EDDYCJY/article/details/77482228
ln -s /Users/xinyinhe/Library/Python/2.7/bin/scrapy /usr/local/bin/scrapy
找到python/bin/scrapy 然後進行軟連結

python爬蟲Scrapy框架
2018-11-21
Python爬蟲框架
Python爬蟲—Scrapy框架
2020-10-04
Python爬蟲框架
Python爬蟲之Scrapy學習（基礎篇）
2019-03-04
Python爬蟲
JB的Python之旅-爬蟲篇--requests&Scrapy
2018-06-08
Python爬蟲
Python Scrapy 爬蟲（二）：scrapy 初試
2018-08-13
Python爬蟲
Python爬蟲深造篇(四)——Scrapy爬蟲框架啟動一個真正的專案
2021-11-08
Python爬蟲框架
Python爬蟲教程-30-Scrapy 爬蟲框架介紹
2018-09-06
Python爬蟲框架
python網路爬蟲（14）使用Scrapy搭建爬蟲框架
2019-07-27
Python爬蟲框架
爬蟲--Scrapy簡易爬蟲
2020-10-07
爬蟲
Python爬蟲教程-31-建立 Scrapy 爬蟲框架專案
2018-09-04
Python爬蟲框架
Scrapy爬蟲-草稿
2018-09-08
爬蟲
Scrapy爬蟲框架
2024-11-13
爬蟲框架
Python3爬蟲（十八） Scrapy框架（二）
2018-10-26
Python爬蟲框架
python爬蟲常用之Scrapy 中介軟體
2018-12-22
Python爬蟲
python爬蟲系列（三）scrapy基本概念
2018-09-26
Python爬蟲
Python網路爬蟲4 - scrapy入門
2018-05-29
Python爬蟲
python 爬蟲對 scrapy 框架的認識
2020-07-17
Python爬蟲框架
Python爬蟲 ---scrapy框架初探及實戰
2020-04-16
Python爬蟲框架
利用scrapy建立初始Python爬蟲專案
2018-03-04
Python爬蟲
爬蟲（9） - Scrapy框架(1) | Scrapy 非同步網路爬蟲框架
2022-07-05
爬蟲框架非同步
scrapy爬蟲代理池
2018-08-28
爬蟲
爬蟲實戰scrapy
2018-03-11
爬蟲
Python爬蟲 --- 2.3 Scrapy 框架的簡單使用
2018-12-19
Python爬蟲框架
Python爬蟲教程-33-scrapy shell 的使用
2018-09-06
Python爬蟲
Python爬蟲框架：scrapy爬取高考派大學資料
2019-10-07
Python爬蟲框架
Scrapy框架的使用之Scrapy通用爬蟲
2018-05-21
框架爬蟲
scrapy之分散式爬蟲scrapy-redis
2020-12-24
分散式爬蟲Redis
Python爬蟲教程-32-Scrapy 爬蟲框架專案 Settings.py 介紹
2018-09-06
Python爬蟲框架
scrapy + mogoDB 網站爬蟲
2019-05-19
Go網站爬蟲
爬蟲框架-scrapy的使用
2021-04-28
爬蟲框架
Scrapy爬蟲框架的使用
2021-01-17
爬蟲框架
使用python的scrapy來編寫一個爬蟲
2019-03-14
Python爬蟲
Python爬蟲筆記（4）：利用scrapy爬取豆瓣電影250
2018-11-10
Python爬蟲筆記
為什麼學習python及爬蟲，Python爬蟲[入門篇]？
2018-11-21
Python爬蟲
Python 爬蟲（六）：使用 Scrapy 爬取去哪兒網景區資訊
2019-10-20
Python爬蟲
爬蟲 Scrapy框架爬取圖蟲圖片並下載
2018-08-27
爬蟲框架
爬蟲（14） - Scrapy-Redis分散式爬蟲(1) | 詳解
2022-07-06
爬蟲Redis分散式
【Python學習】爬蟲爬蟲爬蟲爬蟲~
2018-05-03
Python爬蟲