818壽司外賣資料

大志若魚吐吐泡泡發表於2019-02-26

幾個月前空閒時候爬了下外賣的壽司資料(才不會承認是那段時間靠外賣維持生存),得閒寫寫分享下。本文適合圍觀群眾和有一丁點基礎的人。

tips:本爬蟲為了提高爬取速度,使用了非同步協程,有需要且資料量小的噴油並不建議這麼使用,會被封掉,可以修改為常規同步程式碼。

根據資料分析的ETL流程,該小爬蟲講解如下:

  1. 先準備下面的Python第三方包:
import pandas as pd
import requests
import aiohttp
import asyncio
from multiprocessing.pool import Pool
from datetime import date
import pymysql
from sqlalchemy import create_engine
import collections
複製程式碼
  1. 然後選擇一個外賣平臺進行分析,這裡我選擇的是ele.me,原因就是因為簡單!簡單!簡單!
    ele.me可以直接通過Chrome的web分析找到資料介面,沒有那麼多反爬套路。
    接下來上正餐~~~~
    2.1先定義好一個請求資料的函式:
async def gethtml(url):
    header = {
        `Accept`: `application/json, text/plain, */*`,
        `Accept-Encoding`: `gzip, deflate, br`,
        `Accept-Language`: `zh-CN,zh;q=0.9,en;q=0.8`,
        `Cache-Control`: `max-age=0`,
        `Connection`: `keep-alive`,
        `Host`: `www.ele.me`,
        `Referer`: `https://www.ele.me/place/wsbrgts6d1ry?latitude=28.111704&longitude=113.011304`,
        `x-shard`: `loc=113.011304,28.111704`,
        `User-Agent`: `Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36`,
    }
    try:
        async with aiohttp.ClientSession() as session:         
            async with session.get(url=url, headers=header) as r:
                # time.sleep(0.5)
                if not r.raise_for_status():
                    data = await r.json()            
                # print(data)
                # data = ujson.loads(data)
                return data
    except Exception as e:
        print(e)
        pass
複製程式碼

後續的資料請求都是通過這個函式,因為使用的是非同步協程,所以使用async定義。

2.2 接下來是資料提取函式:

def getshopid(html):
    shop_id = {i[`restaurant`][`id`] for i in html[`restaurant_with_foods`]}
    return shop_id


def geturl(ids):
    restaurant_url = {`https://www.ele.me/restapi/shopping/restaurant/%s?latitude=28.09515&longitude=113.012001&terminal=web` %
                      shop_id for shop_id in ids}
    foodurl = {`https://www.ele.me/restapi/shopping/v2/menu?restaurant_id=%s&terminal=web` %
               shop_id for shop_id in ids}
    return restaurant_url, foodurl
複製程式碼

函式分別是獲取店鋪id,獲取店鋪詳情,這裡面需要注意的是提取資料要注意去重,這裡使用了簡單暴力的集合資料結構去重。

2.3 資料提取完畢,接下來使用pandas重新載入資料做最後的分析,如下:

def food_table(foodlists):
    foods = {(y[`specfoods`][0][`restaurant_id`], y[`name`], y[`specfoods`][0][`price`],y[`month_sales`], date.today().strftime(`%Y-%m-%d`), date.today().strftime(`%A`)) for foodlist in foodlists for x in foodlist for y in x[`foods`]}
    return foods


def shop_table(shoplist):
    shop_detail = {(shop[`id`], shop[`name`], shop[`distance`], shop[`float_delivery_fee`],shop[`float_minimum_order_amount`], shop[`rating`], shop[`rating_count`]) for shop in shoplist}
    return shop_detail
複製程式碼

函式分別是生成食物詳情表,店鋪詳情表。

2.4 最後一步就是做分析,使用pandas處理,這裡以簡單的每個店鋪月銷售總額做為指標:

def join_table(shoptable, foodtable):
    shoptable = pd.DataFrame(list(shoptable), columns=[ `id`, `name`, `distance`, `delivery_fee`, `minimum_order_amount`, `rating`, `rating_count`])
    foodtable = pd.DataFrame(list(foodtable), columns=[`id`, `fname`, `price`, `msale`, `date`, `weekday`])
    # print(foodtable.values)
    new = pd.merge(shoptable, foodtable, on=`id`)
    new[`total`] = new[`msale`] * new[`price`]
    group = new.groupby([`name`, `id`])
    return new, group.sum()
複製程式碼

這一步是用pandas替代了SQL做處理,也可以存入MySQL中再處理,程式碼如下:

connect = create_engine( `mysql+pymysql://root:12345678@localhost:3306/waimai?charset=utf8`)
pd.io.sql.to_sql(frame=detail, name=k, con=connect, if_exists=`append`)
複製程式碼
  1. 處理函式全部定義好,就可以開始寫main函式了:
async def main(name):
    pool = Pool(8)
    # html = await gethtml(yangqi)
    htasks = [asyncio.ensure_future(gethtml(url))for url in name]
    htmls = await asyncio.gather(*htasks)
    # ids = getshopid(html)
    # print(htmls)
    ids = [getshopid(html) for html in htmls]
    # print(ids)
    restaurant_url, food_url = geturl(ids[0])
    print(`async crawl...`)
    shoptasks = [asyncio.ensure_future(
        gethtml(url)) for url in restaurant_url]
    foodtasks = [asyncio.ensure_future(
        gethtml(url)) for url in food_url]
    fdone, fpending = await asyncio.wait(foodtasks)
    sdone, spending = await asyncio.wait(shoptasks)
    shoplist = [task.result() for task in sdone]
    foodlist = [task.result() for task in fdone]
    print(`distribute pasrse....`)
    sparse_jobs = [pool.apply_async(shop_table, args=(shoplist,))]
    fparse_jobs = [pool.apply_async(food_table, args=(foodlist,))]
    shoptable = [x.get() for x in sparse_jobs][0]
    foodtable = [x.get() for x in fparse_jobs][0]
    new, result = join_table(shoptable, foodtable)

    return new, result
複製程式碼
  1. 最後一波操作,執行main函式:
while len(lists)>0:
    for k,v in list(lists.items()): 
        try:
            loop = asyncio.get_event_loop()
            tasks = asyncio.ensure_future(main(v))
            loop.run_until_complete(tasks)
            detail, totals = tasks.result()

            lists.pop(k)
            print(`done:{}`.format(k))                  
        except KeyError:
            print(`fail:{}`.format(k))
            pass
        else:
            connect = create_engine( `mysql+pymysql://root:12345678@localhost:3306/waimai?charset=utf8`)
            pd.io.sql.to_sql(frame=detail, name=k, con=connect, if_exists=`append`)
複製程式碼

因為是非同步,需要在事件迴圈中執行。裡面的lists就是自己想要搜尋的區域中的外賣店列表,下面提供幾個列表示例:

wuyisquare=[`https://www.ele.me/restapi/shopping/restaurants/search?extras%5B%5D=activity&keyword=%E5%92%96%E5%95%A1&latitude=28.19652&limit=100&longitude=112.977361&offset={0}&terminal=web`.format(x) for x in range(0, 120, 24)]
sushi = [`https://www.ele.me/restapi/shopping/restaurants/search?extras%5B%5D=activity&keyword=%E5%AF%BF%E5%8F%B8&latitude=28.111704&limit=100&longitude=113.011304&offset={0}&terminal=web`.format(x) for x in range(0, 120, 24)]
yangqi = [`https://www.ele.me/restapi/shopping/restaurants/search?extras%5B%5D=activity&keyword=%E8%8C%B6&latitude=28.23188&limit=100&longitude=112.871522&offset={0}&terminal=web`.format(x) for x in range(0, 120, 24)]
tea = [`https://www.ele.me/restapi/shopping/restaurants/search?extras%5B%5D=activity&keyword=%E5%92%96%E5%95%A1&latitude=28.09515&limit=100&longitude=113.012001&offset={0}&terminal=web`.format(x) for x in range(0, 120, 24)]
fen = [`https://www.ele.me/restapi/shopping/restaurants/search?extras%5B%5D=activity&keyword=%E7%AD%92%E5%AD%90%E9%AA%A8%E7%B2%89&latitude=28.111704&limit=100&longitude=113.011304&offset={0}&terminal=web`.format(x) for x in range(0, 120, 24)]
gaosheng = [`https://www.ele.me/restapi/shopping/restaurants/search?extras%5B%5D=activity&keyword=%E7%B2%89&latitude=28.09515&limit=100&longitude=113.012001&offset={0}&terminal=web`.format(x) for x in range(0, 120, 24)]
fangcun = [`https://www.ele.me/restapi/shopping/restaurants/search?extras%5B%5D=activity&keyword=%E6%96%B9%E5%AF%B8%E5%AF%BF%E5%8F%B8&latitude=28.23188&limit=100&longitude=112.871522&offset={0}&terminal=web`.format(x) for x in range(0, 120, 24)]
luoyide = [`https://www.ele.me/restapi/shopping/restaurants/search?extras%5B%5D=activity&keyword=%E7%BD%97%E4%B9%89%E5%BE%B7&latitude=28.23188&limit=100&longitude=112.871522&offset={0}&terminal=web`.format(x) for x in range(0, 120, 24)]
lists={`sushi`:sushi,`tea`:tea,`fen`:fen,`gaosheng`:gaosheng,`luoyide`:luoyide,`fangcun`:fangcun}
複製程式碼

URL只要替換keyword和latitude,longitude就可以搜尋自己想要區域,經度緯度可以通過各類地圖API獲取,這裡就不打廣告了

這個爬蟲使用了非同步請求,集合去重,pandas的資料庫同步寫入等基礎知識,適合練手,至於資料的價值自己慢慢挖掘,有點意思。
比如月售與各種維度的關係,比如散點圖,柱狀圖,日曆熱點圖:

image.png
image

下一波玩一玩微信和QQ機器人,敬請期待~~~~~

寫的這些文章是給剛入門的噴油做些參考,歡迎點星狂贊,順便打個廣告,顏值計算器小程式,原始碼看這裡

image.png
image.png

相關文章