818壽司外賣資料

大志若魚吐吐泡泡發表於2019-02-26

原文網址 : https://flycode.co/archives/266769

幾個月前空閒時候爬了下外賣的壽司資料（才不會承認是那段時間靠外賣維持生存），得閒寫寫分享下。本文適合圍觀群眾和有一丁點基礎的人。

tips:本爬蟲為了提高爬取速度，使用了非同步協程，有需要且資料量小的噴油並不建議這麼使用，會被封掉，可以修改為常規同步程式碼。

根據資料分析的ETL流程，該小爬蟲講解如下：

先準備下面的Python第三方包：

import pandas as pd
import requests
import aiohttp
import asyncio
from multiprocessing.pool import Pool
from datetime import date
import pymysql
from sqlalchemy import create_engine
import collections
複製程式碼

然後選擇一個外賣平臺進行分析，這裡我選擇的是ele.me，原因就是因為簡單！簡單！簡單！ ele.me可以直接通過Chrome的web分析找到資料介面，沒有那麼多反爬套路。接下來上正餐~~~~ 2.1先定義好一個請求資料的函式：

async def gethtml(url):
    header = {
        'Accept': 'application/json, text/plain, */*',
        'Accept-Encoding': 'gzip, deflate, br',
        'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
        'Cache-Control': 'max-age=0',
        'Connection': 'keep-alive',
        'Host': 'www.ele.me',
        'Referer': 'https://www.ele.me/place/wsbrgts6d1ry?latitude=28.111704&longitude=113.011304',
        'x-shard': 'loc=113.011304,28.111704',
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36',
    }
    try:
        async with aiohttp.ClientSession() as session:         
            async with session.get(url=url, headers=header) as r:
                # time.sleep(0.5)
                if not r.raise_for_status():
                    data = await r.json()            
                # print(data)
                # data = ujson.loads(data)
                return data
    except Exception as e:
        print(e)
        pass
複製程式碼

後續的資料請求都是通過這個函式，因為使用的是非同步協程，所以使用async定義。

2.2 接下來是資料提取函式：

def getshopid(html):
    shop_id = {i['restaurant']['id'] for i in html['restaurant_with_foods']}
    return shop_id


def geturl(ids):
    restaurant_url = {'https://www.ele.me/restapi/shopping/restaurant/%s?latitude=28.09515&longitude=113.012001&terminal=web' %
                      shop_id for shop_id in ids}
    foodurl = {'https://www.ele.me/restapi/shopping/v2/menu?restaurant_id=%s&terminal=web' %
               shop_id for shop_id in ids}
    return restaurant_url, foodurl
複製程式碼

函式分別是獲取店鋪id，獲取店鋪詳情，這裡面需要注意的是提取資料要注意去重，這裡使用了簡單暴力的集合資料結構去重。

2.3 資料提取完畢，接下來使用pandas重新載入資料做最後的分析，如下:

def food_table(foodlists):
    foods = {(y['specfoods'][0]['restaurant_id'], y['name'], y['specfoods'][0]['price'],y['month_sales'], date.today().strftime('%Y-%m-%d'), date.today().strftime('%A')) for foodlist in foodlists for x in foodlist for y in x['foods']}
    return foods


def shop_table(shoplist):
    shop_detail = {(shop['id'], shop['name'], shop['distance'], shop['float_delivery_fee'],shop['float_minimum_order_amount'], shop['rating'], shop['rating_count']) for shop in shoplist}
    return shop_detail
複製程式碼

函式分別是生成食物詳情表，店鋪詳情表。

2.4 最後一步就是做分析，使用pandas處理，這裡以簡單的每個店鋪月銷售總額做為指標：

def join_table(shoptable, foodtable):
    shoptable = pd.DataFrame(list(shoptable), columns=[ 'id', 'name', 'distance', 'delivery_fee', 'minimum_order_amount', 'rating', 'rating_count'])
    foodtable = pd.DataFrame(list(foodtable), columns=['id', 'fname', 'price', 'msale', 'date', 'weekday'])
    # print(foodtable.values)
    new = pd.merge(shoptable, foodtable, on='id')
    new['total'] = new['msale'] * new['price']
    group = new.groupby(['name', 'id'])
    return new, group.sum()
複製程式碼

這一步是用pandas替代了SQL做處理，也可以存入MySQL中再處理，程式碼如下：

connect = create_engine( 'mysql+pymysql://root:12345678@localhost:3306/waimai?charset=utf8')
pd.io.sql.to_sql(frame=detail, name=k, con=connect, if_exists='append')
複製程式碼

處理函式全部定義好，就可以開始寫main函式了：

async def main(name):
    pool = Pool(8)
    # html = await gethtml(yangqi)
    htasks = [asyncio.ensure_future(gethtml(url))for url in name]
    htmls = await asyncio.gather(*htasks)
    # ids = getshopid(html)
    # print(htmls)
    ids = [getshopid(html) for html in htmls]
    # print(ids)
    restaurant_url, food_url = geturl(ids[0])
    print('async crawl...')
    shoptasks = [asyncio.ensure_future(
        gethtml(url)) for url in restaurant_url]
    foodtasks = [asyncio.ensure_future(
        gethtml(url)) for url in food_url]
    fdone, fpending = await asyncio.wait(foodtasks)
    sdone, spending = await asyncio.wait(shoptasks)
    shoplist = [task.result() for task in sdone]
    foodlist = [task.result() for task in fdone]
    print('distribute pasrse....')
    sparse_jobs = [pool.apply_async(shop_table, args=(shoplist,))]
    fparse_jobs = [pool.apply_async(food_table, args=(foodlist,))]
    shoptable = [x.get() for x in sparse_jobs][0]
    foodtable = [x.get() for x in fparse_jobs][0]
    new, result = join_table(shoptable, foodtable)

    return new, result
複製程式碼

最後一波操作，執行main函式：

while len(lists)>0:
    for k,v in list(lists.items()): 
        try:
            loop = asyncio.get_event_loop()
            tasks = asyncio.ensure_future(main(v))
            loop.run_until_complete(tasks)
            detail, totals = tasks.result()

            lists.pop(k)
            print('done:{}'.format(k))                  
        except KeyError:
            print('fail:{}'.format(k))
            pass
        else:
            connect = create_engine( 'mysql+pymysql://root:12345678@localhost:3306/waimai?charset=utf8')
            pd.io.sql.to_sql(frame=detail, name=k, con=connect, if_exists='append')
複製程式碼

因為是非同步，需要在事件迴圈中執行。裡面的lists就是自己想要搜尋的區域中的外賣店列表，下面提供幾個列表示例：

wuyisquare=['https://www.ele.me/restapi/shopping/restaurants/search?extras%5B%5D=activity&keyword=%E5%92%96%E5%95%A1&latitude=28.19652&limit=100&longitude=112.977361&offset={0}&terminal=web'.format(x) for x in range(0, 120, 24)]
sushi = ['https://www.ele.me/restapi/shopping/restaurants/search?extras%5B%5D=activity&keyword=%E5%AF%BF%E5%8F%B8&latitude=28.111704&limit=100&longitude=113.011304&offset={0}&terminal=web'.format(x) for x in range(0, 120, 24)]
yangqi = ['https://www.ele.me/restapi/shopping/restaurants/search?extras%5B%5D=activity&keyword=%E8%8C%B6&latitude=28.23188&limit=100&longitude=112.871522&offset={0}&terminal=web'.format(x) for x in range(0, 120, 24)]
tea = ['https://www.ele.me/restapi/shopping/restaurants/search?extras%5B%5D=activity&keyword=%E5%92%96%E5%95%A1&latitude=28.09515&limit=100&longitude=113.012001&offset={0}&terminal=web'.format(x) for x in range(0, 120, 24)]
fen = ['https://www.ele.me/restapi/shopping/restaurants/search?extras%5B%5D=activity&keyword=%E7%AD%92%E5%AD%90%E9%AA%A8%E7%B2%89&latitude=28.111704&limit=100&longitude=113.011304&offset={0}&terminal=web'.format(x) for x in range(0, 120, 24)]
gaosheng = ['https://www.ele.me/restapi/shopping/restaurants/search?extras%5B%5D=activity&keyword=%E7%B2%89&latitude=28.09515&limit=100&longitude=113.012001&offset={0}&terminal=web'.format(x) for x in range(0, 120, 24)]
fangcun = ['https://www.ele.me/restapi/shopping/restaurants/search?extras%5B%5D=activity&keyword=%E6%96%B9%E5%AF%B8%E5%AF%BF%E5%8F%B8&latitude=28.23188&limit=100&longitude=112.871522&offset={0}&terminal=web'.format(x) for x in range(0, 120, 24)]
luoyide = ['https://www.ele.me/restapi/shopping/restaurants/search?extras%5B%5D=activity&keyword=%E7%BD%97%E4%B9%89%E5%BE%B7&latitude=28.23188&limit=100&longitude=112.871522&offset={0}&terminal=web'.format(x) for x in range(0, 120, 24)]
lists={'sushi':sushi,'tea':tea,'fen':fen,'gaosheng':gaosheng,'luoyide':luoyide,'fangcun':fangcun}
複製程式碼

URL只要替換keyword和latitude，longitude就可以搜尋自己想要區域，經度緯度可以通過各類地圖API獲取，這裡就不打廣告了

這個爬蟲使用了非同步請求，集合去重，pandas的資料庫同步寫入等基礎知識，適合練手，至於資料的價值自己慢慢挖掘，有點意思。比如月售與各種維度的關係，比如散點圖，柱狀圖，日曆熱點圖：

下一波玩一玩微信和QQ機器人，敬請期待~~~~~

寫的這些文章是給剛入門的噴油做些參考，歡迎點星狂贊，順便打個廣告，顏值計算器小程式，原始碼看這裡

JOISC 2016 回転壽司
2024-04-03
輕食外賣的資料指標
2018-08-31
指標
P2150 [NOI2015] 壽司晚宴
2024-08-05
美團外賣國慶消費大資料
2020-10-07
大資料
美團外賣：中國輕食外賣消費報告
2019-10-20
mpvue外賣小程式
2018-09-08
Vue
XX外賣專案
2024-08-08
外賣大叔網 - 餓了輕鬆點外賣，就是這麼任性！
2019-05-11
資料結構上機實驗3——圖——外賣成本最優問題
2020-12-08
資料結構
深度揭祕：大資料時代企業賣技術還是賣資料?
2018-03-30
大資料
WeGeek Talk | 美團外賣
2018-11-07
美團：2020年五一消費大資料燒烤成夜間外賣最愛
2020-05-06
大資料
貝恩&美團外賣：BETTER外賣經營體系白皮書（附下載）
2024-01-19
回轉壽司你一定吃過！——Android訊息機制（處理）
2019-02-15
Android
回轉壽司你一定吃過！——Android訊息機制（分發）
2019-02-15
Android
網路流經典模型之一：最大權閉合子圖（壽司餐廳）
2020-10-12
模型
回轉壽司你一定吃過！——Android訊息機制（構造）
2019-03-04
Android
蒼穹外賣 - day1
2024-07-10
抖音“勇闖”外賣江湖
2023-03-08
解決外賣配送最後一公里：外賣櫃存在哪些問題
2021-10-12
2021年美國外賣平臺DoorDash市場訂單總額（附原資料表）
2022-03-04
美團外賣極速支付怎麼取消？美團外賣極速支付的取消方法
2020-10-10
高仿美團外賣小程式
2024-04-14
美團外賣Android Crash治理之路
2019-03-03
Android
寶鯤財經：外匯買賣投資者交易操作建議
2018-09-05
L2-043 龍龍送外賣
2024-03-28
您的外賣為什麼涼了？
2020-12-01
DBA“老司機”怎麼看待Oracle自治資料倉儲？
2018-09-19
Oracle
調查顯示約28%的美國外賣小哥承認曾經偷吃使用者的外賣
2019-07-29
Crunchbase：2019年外賣行業已經獲得38億美元投資
2019-12-04
行業
網易資料分析高階總監：10年資料分析老司機的深度思考
2021-01-15
WMRouter：美團外賣Android開源路由框架
2018-08-24
Android路由框架
瑞吉外賣專案開發筆記
2024-04-27
筆記
美團外賣Flutter動態化實踐
2020-06-26
Flutter
ECNU OJ 3354 領外賣（博弈-SG函式）
2018-03-23
函式
抖音外賣要全國鋪開了？
2023-02-08
Flutter Web在美團外賣的實踐
2021-03-19
FlutterWeb
賣合同資料搞個生活費沒問題
2022-08-03

818壽司外賣資料

相關文章