26 比較咖啡店各類飲品的數量與熱量

weixin_33913332發表於2018-07-24

原文網址 : https://blog.csdn.net/weixin_33913332/article/details/88270971

image.png

image.png

series資料的結構

dataframe

分組操作

資料來源和格式

pandas 資料來源下載地址：https://video.mugglecode.com/data_pd

image.png

本次分析只關心A和D列.

程式碼

import matplotlib.pyplot as plt
import pandas as pd
import os

#比較咖啡廳選單中各類飲品的熱量

file_path = '/Users/miraco/PycharmProjects/DataMining/data_pd/coffee_menu.csv'
outpath = './coffee_stat/ouptput'

#os.mkdir 與 os.makedirs 的差別在於 os.makedirs 會遞迴地去建立目錄，也就是說連同中繼的目錄也會一起建立
if not os.path.exists(outpath):
    os.makedirs(outpath)

def collect_data():
    data_df = pd.read_csv(file_path)  #這是二維陣列
    return data_df
def inspect_data(data_df):
    #資料有噪聲的時候，讀取為保險起見，會被讀取成obj型別
    print(f'資料一共有{data_df.shape[0]}行, {data_df.shape[1]}列')
    print('-----------------------------------------------------')
    print('資料預覽：')
    # 如果想看又怕太多，可以用data_df.head()，只顯示前幾行
    print(data_df.head())
    print('-----------------------------------------------------')
    print('資料的基本資訊：')
    # data_df.info()可以看資料型別，字串看成obj型別，數字會自動讀取成float或int
    print(data_df.info())
    print('-----------------------------------------------------')
    print('資料統計資訊')
    #均值、最大值、最小值啥的
    print(data_df.describe())
    print('-----------------------------------------------------')

def analyze_data(data_df):
    bever_cate_col = data_df['Beverage_category']  #這是series，一列，引用列名就行，


    bever_cates  = bever_cate_col.unique()  #一列資料去重，得到各唯一值
    print('飲品類別')
    print(bever_cates)
    print('-----------------------------------------------------')
    cate_grouped = data_df.groupby('Beverage_category')  #返回的是一種特殊的資料型別
    cate_count = cate_grouped['Calories'].count()
    cate_mean_cal = cate_grouped['Calories'].mean()


    return cate_count, cate_mean_cal


def save_and_show_results(cate_count,cate_mean_cal):
    cate_count.to_csv(os.path.join(outpath,'cate_count.csv'))
    cate_mean_cal.to_csv(os.path.join(outpath, 'cate_mean_cal.csv'))

    cate_count.plot(kind = 'bar')   #直接畫，無需重複呼叫figure
    plt.title('Category Count')
    plt.tight_layout()
    plt.savefig(os.path.join(outpath, 'category_count.png'))
    plt.show()

    cate_mean_cal.plot(kind = 'bar')
    plt.title('Category Average Cals')
    plt.tight_layout()
    plt.savefig(os.path.join(outpath, 'category_ave_cals.png'))
    plt.show()

def main():
    #資料獲取
    data_df = collect_data()

    #檢視資料資訊
    inspect_data(data_df)

    #分析
    cate_count, cate_mean_cal  = analyze_data(data_df)

    #結果展示
    save_and_show_results(cate_count, cate_mean_cal)

if __name__ == '__main__':
    main()

執行結果

資料一共有242行, 18列
-----------------------------------------------------
資料預覽：
         Beverage_category      ...      Caffeine (mg)
0                   Coffee      ...                175
1                   Coffee      ...                260
2                   Coffee      ...                330
3                   Coffee      ...                410
4  Classic Espresso Drinks      ...                 75

[5 rows x 18 columns]
-----------------------------------------------------
資料的基本資訊：
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 242 entries, 0 to 241
Data columns (total 18 columns):
Beverage_category            242 non-null object
Beverage                     242 non-null object
Beverage_prep                242 non-null object
Calories                     242 non-null int64
 Total Fat (g)               242 non-null object
Trans Fat (g)                242 non-null float64
Saturated Fat (g)            242 non-null float64
 Sodium (mg)                 242 non-null int64
 Total Carbohydrates (g)     242 non-null int64
Cholesterol (mg)             242 non-null int64
 Dietary Fibre (g)           242 non-null int64
 Sugars (g)                  242 non-null int64
 Protein (g)                 242 non-null float64
Vitamin A (% DV)             242 non-null object
Vitamin C (% DV)             242 non-null object
 Calcium (% DV)              242 non-null object
Iron (% DV)                  242 non-null object
Caffeine (mg)                241 non-null object
dtypes: float64(3), int64(6), object(9)
memory usage: 34.1+ KB
None
-----------------------------------------------------
資料統計資訊
         Calories  Trans Fat (g)       ...         Sugars (g)   Protein (g) 
count  242.000000      242.000000      ...         242.000000     242.000000
mean   193.871901        1.307025      ...          32.962810       6.978512
std    102.863303        1.640259      ...          19.730199       4.871659
min      0.000000        0.000000      ...           0.000000       0.000000
25%    120.000000        0.100000      ...          18.000000       3.000000
50%    185.000000        0.500000      ...          32.000000       6.000000
75%    260.000000        2.000000      ...          43.750000      10.000000
max    510.000000        9.000000      ...          84.000000      20.000000

[8 rows x 9 columns]
-----------------------------------------------------
飲品類別
['Coffee' 'Classic Espresso Drinks' 'Signature Espresso Drinks'
 'Tazo® Tea Drinks' 'Shaken Iced Beverages' 'Smoothies'
 'Frappuccino® Blended Coffee' 'Frappuccino® Light Blended Coffee'
 'Frappuccino® Blended Crème']
-----------------------------------------------------

這個竟然會自動配色

練習

使用柱狀圖視覺化 PM2.5數值

題目要求:
使用Pandas檢視資料檔案的基本資訊
使用Pandas進行資料分析及視覺化
資料來源下載地址：https://video.mugglecode.com/Beijing_PM.csv
Beijing_PM.csv，包含了2013-2015年北京每小時的PM2.5值。每行記錄為1小時的資料。
共7列資料，分別表示：

year: 年，2013-2015
month: 月，1-12
day: 日，1-31
hour: 小時，0-23
season：季度，1-4
PM_China: 中國環保部檢測的PM2.5值
PM_US: 美國使館檢測的PM2.5值

# -*- coding: utf-8 -*-

import pandas as pd
import matplotlib.pyplot as plt
import os

file_path = '/Users/miraco/PycharmProjects/DataMining/data_pd/coffee_menu.csv'
outpath = './coffee_stat/ouptput'

#os.mkdir 與 os.makedirs 的差別在於 os.makedirs 會遞迴地去建立目錄，也就是說連同中繼的目錄也會一起建立
if not os.path.exists(outpath):
    os.makedirs(outpath)

# 1. 讀取csv資料檔案
data_df = pd.read_csv('/Users/miraco/PycharmProjects/DataMining/data_pd/Beijing_PM.csv')

# 2. 檢視資料檔案的基本資訊
print('資料預覽：')
print(data_df.head())

print('資料檔案的基本資訊：')
print(data_df.info())

print('資料內容的統計資訊：')
print(data_df.describe())

# 4. 按照year列進行分組統計
year_average_pm = data_df.groupby('year')['PM_China'].mean()

# 4. 結果儲存
year_average_pm.to_csv(os.path.join(outpath,'year_average_pm.csv'))

# 5. 繪製柱狀圖
year_average_pm.plot(kind='bar')
plt.tight_layout()
plt.show()

需要注意的是，資料裡面有的列裡面含有NA，這個在統計時候會跳過的。

執行結果

資料預覽：
   year  month  day  hour  season  PM_China  PM_US
0  2013      1    1     0       4       NaN   31.0
1  2013      1    1     1       4       NaN   32.0
2  2013      1    1     2       4       NaN   21.0
3  2013      1    1     3       4       NaN   16.0
4  2013      1    1     4       4       NaN   15.0
資料檔案的基本資訊：
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26280 entries, 0 to 26279
Data columns (total 7 columns):
year        26280 non-null int64
month       26280 non-null int64
day         26280 non-null int64
hour        26280 non-null int64
season      26280 non-null int64
PM_China    20508 non-null float64
PM_US       25970 non-null float64
dtypes: float64(2), int64(5)
memory usage: 1.4 MB
None
資料內容的統計資訊：
               year         month      ...           PM_China         PM_US
count  26280.000000  26280.000000      ...       20508.000000  25970.000000
mean    2014.000000      6.526027      ...          92.560806     94.094686
std        0.816512      3.447917      ...          88.027434     93.806554
min     2013.000000      1.000000      ...           3.000000      1.000000
25%     2013.000000      4.000000      ...          28.000000     27.000000
50%     2014.000000      7.000000      ...          68.000000     66.000000
75%     2015.000000     10.000000      ...         127.000000    126.000000
max     2015.000000     12.000000      ...         672.000000    886.000000

[8 rows x 7 columns]

這個圖也是自動賦色的

關鍵語句

#讀取csv資料檔案
data_df = pd.read_csv('/Users/miraco/PycharmProjects/DataMining/data_pd/Beijing_PM.csv')

#按照year列進行分組統計
year_average_pm = data_df.groupby('year')['PM_China'].mean()

#結果儲存
year_average_pm.to_csv(os.path.join(outpath,'year_average_pm.csv'))

#繪製柱狀圖
year_average_pm.plot(kind='bar')

5.2熱量比熱容（1）
2024-03-06
matlab標量或矩陣比較
2020-10-13
Matlab矩陣
2024 年 Go 與 Rust 的較量
2024-06-21
GoRust
集合類的比較
2018-09-05
儲存結構的種類與比較
2024-01-11
今年餐飲業比較難
2021-01-03
便宜茶葉有質量比較好的嗎？
2021-01-23
大偏移量下Redis與MongoDB的分頁/排名效能比較RH
2022-03-21
RedisMongoDB
樣本量估算：隨機對照試驗（兩組均數）比較的樣本量計算方法
2020-12-22
隨機
類和類之間的比較
2020-11-21
JS 物件合併與克隆方法的分類與比較
2019-03-04
JS物件
C++與Rust變數宣告的比較
2018-11-13
C++Rust變數
Java鎖與非阻塞演算法的效能比較與分析+原子變數類的應用
2021-12-28
Java演算法變數
抽象類 & 介面比較
2024-11-17
抽象
浮點數的比較
2020-09-19
元學習：人類與大模型比較建模
2024-07-01
大模型
Leetcode第 217 場周賽(思維量比較大)
2020-11-30
LeetCode
數字比較
2018-10-14
各種各類好用熱門API推薦
2023-11-06
API
volatile與Atomic的比較
2020-11-12
==與equals比較
2021-05-25
記一次與挖礦木馬的較量
2022-02-23
【極客升維記】人與機器的較量
2023-02-06
探索生成式AI的未來：Chat與Agent的較量與融合
2024-04-09
AI
JS嵌入html的方式及各種方式的比較
2019-02-16
JSHTML
樹刷題題後感——相對來說概念和公式數量可以和數論比較
2024-05-14
公式
用opencv測量物體大概角度的一個比較笨的思路
2020-12-09
OpenCV
思想戰場上的較量
2022-01-17
Javers 比較兩個類的差異
2022-02-24
各類熱門的免費API推薦
2023-11-13
API
MVVM與MVC模式的比較
2018-09-03
MVVMMVC模式
PostgreSQL與MySQL的比較 - hackr
2020-05-24
MySql
XTask與RxJava的使用比較
2022-04-10
RxJava
Flutter與React Native的比較
2021-05-01
FlutterReact Native
深圳質量比較高階的品茶桑拿休閒會所場子
2021-05-07
從高考到奧林匹克競技場：大模型與人類智慧的終極較量
2024-06-20
大模型
各類熱門免費API合集
2023-02-21
API
Go 與 C++ 的對比和比較
2021-07-12
GoC++

26 比較咖啡店各類飲品的數量與熱量

資料來源和格式

程式碼

執行結果

練習

執行結果

關鍵語句

相關文章