簡介
為了更好的熟練掌握pandas在實際資料分析中的應用,今天我們再介紹一下怎麼使用pandas做美國餐廳評分資料的分析。
餐廳評分資料簡介
資料的來源是UCI ML Repository,包含了一千多條資料,有5個屬性,分別是:
userID: 使用者ID
placeID:餐廳ID
rating:總體評分
food_rating:食物評分
service_rating:服務評分
我們使用pandas來讀取資料:
import numpy as np
path = '../data/restaurant_rating_final.csv'
df = pd.read_csv(path)
df
userID | placeID | rating | food_rating | service_rating | |
---|---|---|---|---|---|
0 | U1077 | 135085 | 2 | 2 | 2 |
1 | U1077 | 135038 | 2 | 2 | 1 |
2 | U1077 | 132825 | 2 | 2 | 2 |
3 | U1077 | 135060 | 1 | 2 | 2 |
4 | U1068 | 135104 | 1 | 1 | 2 |
... | ... | ... | ... | ... | ... |
1156 | U1043 | 132630 | 1 | 1 | 1 |
1157 | U1011 | 132715 | 1 | 1 | 0 |
1158 | U1068 | 132733 | 1 | 1 | 0 |
1159 | U1068 | 132594 | 1 | 1 | 1 |
1160 | U1068 | 132660 | 0 | 0 | 0 |
1161 rows × 5 columns
分析評分資料
如果我們關注的是不同餐廳的總評分和食物評分,我們可以先看下這些餐廳評分的平均數,這裡我們使用pivot_table方法:
mean_ratings = df.pivot_table(values=['rating','food_rating'], index='placeID',
aggfunc='mean')
mean_ratings[:5]
food_rating | rating | |
---|---|---|
placeID | ||
132560 | 1.00 | 0.50 |
132561 | 1.00 | 0.75 |
132564 | 1.25 | 1.25 |
132572 | 1.00 | 1.00 |
132583 | 1.00 | 1.00 |
然後再看一下各個placeID,投票人數的統計:
ratings_by_place = df.groupby('placeID').size()
ratings_by_place[:10]
placeID
132560 4
132561 4
132564 4
132572 15
132583 4
132584 6
132594 5
132608 6
132609 5
132613 6
dtype: int64
如果投票人數太少,那麼這些資料其實是不客觀的,我們來挑選一下投票人數超過4個的餐廳:
active_place = ratings_by_place.index[ratings_by_place >= 4]
active_place
Int64Index([132560, 132561, 132564, 132572, 132583, 132584, 132594, 132608,
132609, 132613,
...
135080, 135081, 135082, 135085, 135086, 135088, 135104, 135106,
135108, 135109],
dtype='int64', name='placeID', length=124)
選擇這些餐廳的平均評分資料:
mean_ratings = mean_ratings.loc[active_place]
mean_ratings
food_rating | rating | |
---|---|---|
placeID | ||
132560 | 1.000000 | 0.500000 |
132561 | 1.000000 | 0.750000 |
132564 | 1.250000 | 1.250000 |
132572 | 1.000000 | 1.000000 |
132583 | 1.000000 | 1.000000 |
... | ... | ... |
135088 | 1.166667 | 1.000000 |
135104 | 1.428571 | 0.857143 |
135106 | 1.200000 | 1.200000 |
135108 | 1.181818 | 1.181818 |
135109 | 1.250000 | 1.000000 |
124 rows × 2 columns
對rating進行排序,選擇評分最高的10個:
top_ratings = mean_ratings.sort_values(by='rating', ascending=False)
top_ratings[:10]
food_rating | rating | |
---|---|---|
placeID | ||
132955 | 1.800000 | 2.000000 |
135034 | 2.000000 | 2.000000 |
134986 | 2.000000 | 2.000000 |
132922 | 1.500000 | 1.833333 |
132755 | 2.000000 | 1.800000 |
135074 | 1.750000 | 1.750000 |
135013 | 2.000000 | 1.750000 |
134976 | 1.750000 | 1.750000 |
135055 | 1.714286 | 1.714286 |
135075 | 1.692308 | 1.692308 |
我們還可以計算平均總評分和平均食物評分的差值,並以一欄diff進行儲存:
mean_ratings['diff'] = mean_ratings['rating'] - mean_ratings['food_rating']
sorted_by_diff = mean_ratings.sort_values(by='diff')
sorted_by_diff[:10]
food_rating | rating | diff | |
---|---|---|---|
placeID | |||
132667 | 2.000000 | 1.250000 | -0.750000 |
132594 | 1.200000 | 0.600000 | -0.600000 |
132858 | 1.400000 | 0.800000 | -0.600000 |
135104 | 1.428571 | 0.857143 | -0.571429 |
132560 | 1.000000 | 0.500000 | -0.500000 |
135027 | 1.375000 | 0.875000 | -0.500000 |
132740 | 1.250000 | 0.750000 | -0.500000 |
134992 | 1.500000 | 1.000000 | -0.500000 |
132706 | 1.250000 | 0.750000 | -0.500000 |
132870 | 1.000000 | 0.600000 | -0.400000 |
將資料進行反轉,選擇差距最大的前10:
sorted_by_diff[::-1][:10]
food_rating | rating | diff | |
---|---|---|---|
placeID | |||
134987 | 0.500000 | 1.000000 | 0.500000 |
132937 | 1.000000 | 1.500000 | 0.500000 |
135066 | 1.000000 | 1.500000 | 0.500000 |
132851 | 1.000000 | 1.428571 | 0.428571 |
135049 | 0.600000 | 1.000000 | 0.400000 |
132922 | 1.500000 | 1.833333 | 0.333333 |
135030 | 1.333333 | 1.583333 | 0.250000 |
135063 | 1.000000 | 1.250000 | 0.250000 |
132626 | 1.000000 | 1.250000 | 0.250000 |
135000 | 1.000000 | 1.250000 | 0.250000 |
計算rating的標準差,並選擇最大的前10個:
# Standard deviation of rating grouped by placeID
rating_std_by_place = df.groupby('placeID')['rating'].std()
# Filter down to active_titles
rating_std_by_place = rating_std_by_place.loc[active_place]
# Order Series by value in descending order
rating_std_by_place.sort_values(ascending=False)[:10]
placeID
134987 1.154701
135049 1.000000
134983 1.000000
135053 0.991031
135027 0.991031
132847 0.983192
132767 0.983192
132884 0.983192
135082 0.971825
132706 0.957427
Name: rating, dtype: float64
本文已收錄於 http://www.flydean.com/02-pandas-restaurant/
最通俗的解讀,最深刻的乾貨,最簡潔的教程,眾多你不知道的小技巧等你來發現!
歡迎關注我的公眾號:「程式那些事」,懂技術,更懂你!