【Python篇】scrapy爬蟲

Geeks_Chen發表於2020-11-29

前言

Scrapy是一個為了爬取網站或API資料,提取結構性資料而編寫的應用框架。 可以應用在包括資料探勘,資訊處理或儲存歷史資料等一系列的程式中。


該文章是通過scrapy爬取https://www.yelp.com/關於[Restaurants]模組每個餐館的選單圖片」


1、工具安裝

1.1 安裝Python環境
Mac一般自帶 Python 2.7,不用額外安裝

1.2 安裝pip

sudo install pip

1.3 安裝pycharm
https://www.jetbrains.com/pycharm/

螢幕快照 2019-03-30 上午10.19.30.png

2、建立專案

scrapy startproject imageSpider

3、開始爬蟲

3.1 設定Item

class MenuImageItem(scrapy.Item):
   # define the fields for your item here like:
   image_urls = scrapy.Field()
   image_name = scrapy.Field()

3.2 設定pipelines

class TestspiderPipeline(ImagesPipeline):
    default_headers = {

    }

    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:
            yield Request(image_url, meta={'item': item, 'index': item['image_urls'].index(image_url)})

    def file_path(self, request, response=None, info=None):
        print '---------------'
        item = request.meta['item']  # 通過上面的meta傳遞過來item
        index = request.meta['index']  # 通過上面的index傳遞過來列表中當前下載圖片的下標
        # 圖片檔名,item['carname'][index]得到汽車名稱,request.url.split('/')[-1].split('.')[-1]得到圖片字尾jpg,png
        image_guid = request.url.split('/')[-1]
        # os.path.splitext()[0]
        # 圖片下載目錄 此處item['country']即需要前面item['country']=''.join()......,否則目錄名會變成\u97e9\u56fd\u6c7d\u8f66\u6807\u5fd7\xxx.jpg
        img_file_name = u'source/{0}/images/{1}'.format(item['image_name'],image_guid)

        return img_file_name

3.3 設定setting

ITEM_PIPELINES = {
    # 'imageDemo.pipelines.ImagedemoPipeline': 300,
    # 'imageDemo.pipelines.MenuImagePipeline': 300,
    'imageDemo.pipelines.SubDoubanImgDownloadPipeline': 300,
}
IMAGES_STORE = '/Users/chenxiao/Desktop/'

3.4 編寫spider


import scrapy
from scrapy.selector import Selector
from imageDemo.items import ImagedemoItem
from imageDemo.items import MenuImageItem

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

class imageSpider(scrapy.Spider):
    name = "imageSpider"
    allowed_domains = ["yelp.com"]
    start_urls = [
        # "https://www.yelp.com/search?cflt=restaurants&find_loc=San%20Francisco%2C%20CA&start=0",
        # "https://www.yelp.com/biz_photos/peoples-bistro-san-francisco-4?tab=menu",
        "https://www.yelp.com/biz_photos/peter-luger-brooklyn-2?tab=menu",
    ]

    def parse(self, response):
        sel = Selector(response)
        menuList = sel.xpath('//*[@id="super-container"]/div[2]/div/div[2]/div[2]/ul/li/div/img/@src').extract()
        print menuList
        # print '>>>>>>>'
        yield MenuImageItem(image_urls=menuList, image_name='peter-luger')

        # list = sel.xpath('//*[@id="wrap"]/div[3]/div[2]/div[2]/div/div[1]/div[1]/div/ul/li/div/div/div/div/div[2]/div[1]/div[1]/div[1]/div[1]/h3/a[@class="lemon--a__373c0__IEZFH link__373c0__29943 link-color--blue-dark__373c0__1mhJo link-size--inherit__373c0__2JXk5"]/text()').extract()
        # print list
        # for item in list:
        #     print item
        #     print '>>>>>'
            # yield ImagedemoItem(name=item)

4、下載資源

scrapy crawl imageSpider

螢幕快照 2019-03-31 下午5.13.03.png

5、專案原始碼

https://github.com/GeeksChen/ScrapyDemo

##注意:
1、command not found: scrapy
https://blog.csdn.net/EDDYCJY/article/details/77482228
ln -s /Users/xinyinhe/Library/Python/2.7/bin/scrapy /usr/local/bin/scrapy
找到python/bin/scrapy 然後進行軟連結

相關文章