在上週的文章《用python從不同的表單中提取資料》中,學習瞭如何從不同的源(Google Sheets、CSV和Excel)檢索和統一資料。本教程與上一篇文章是相互獨立的,所以你不必擔心錯過了上週的文章。
如何預處理和合並資料,
如何探索並分析資料,
如何做出漂亮的圖表對結果進行視覺化。
經常從事資料相關工作,
對Python和Pandas有初步理解的人。
情景概述:
你的任務是找出提高你的銷售團隊業績的方法。在我們所假設的情況下,潛在客戶有相當自發的需求。當客戶提出需求時,你的銷售團隊會在系統中設定一個訂單商機。然後,你的銷售代表安排一次會議,會議將在發現訂單商機前後舉行。你的銷售代表有一個開支預算,預算中包括會議費用和餐費。銷售代表支付這些花銷並將發票交給會計團隊處理。在潛在客戶決定是否願意接受你的報價後,勤勞的銷售代表會跟蹤訂單商機是否轉化為銷售。
order_leads(包含所有訂單線索和轉化資訊)
sales_team(包括公司和負責的銷售代表)
invoices(提供發票和參與者的資訊)
載入程式包和屬性設定:
1. import json
2. import pandas as pd
3. import numpy as np
4. %matplotlib inline
5. import matplotlib.pyplot as plt
6. import seaborn as sns
7. sns.set(
8. font_scale=1.5,
9. style="whitegrid",
10. rc={'figure.figsize':(20,7)}
11. )
1. !pip install seaborn
載入資料:
https://github.com/FBosler/Medium-Data-Exploration
1. sales_team = pd.read_csv('sales_team.csv')
2. order_leads = pd.read_csv('order_leads.csv')
3. invoices = pd.read_csv('invoices.csv')
order_leads資料集的前兩行
開始探索資料:
總轉化率的發展趨勢:
1. _ = order_leads.set_index(pd.DatetimeIndex(order_leads.Date)).groupby(
2. pd.Grouper(freq='D')
3. )['Converted'].mean()
4.
5. ax = _.rolling(60).mean().plot(figsize=(20,7),title='Conversion Rate Over Time')
6.
7. vals = ax.get_yticks()
8. ax.set_yticklabels(['{:,.0f}%'.format(x*100) for x in vals])
9. sns.despine()
1.我們使用下劃線“_”作為臨時變數。我通常會這樣生成以後不會再使用的一次性變數。
2.我們對order_leads.Date使用pd.DateTimeIndex,將其設定為序號。
3.使用pd.grouped(freq='D')按天對資料進行分組。或者,你可以將頻率更改為W、M、Q或Y(周、月、季或年)。
4.我們計算每天“轉化”的平均值,即當天訂單的轉化率。
5.我們使用.rolling(60)和.mean()得到60天的平均值。
6.然後,我們設定yticklables的格式,使其顯示百分比符號。
不同銷售代表的轉化率:
1. orders_with_sales_team = pd.merge(order_leads,sales_team,on=['Company Id','Company Name'])
2. ax = sns.distplot(orders_with_sales_team.groupby('Sales Rep Id')['Converted'].mean(),kde=False)
3. vals = ax.get_xticks()
4. ax.set_xticklabels(['{:,.0f}%'.format(x*100) for x in vals])
5. ax.set_title('Number of sales reps by conversion rate')
6. sns.despine()
1. def vertical_mean_line(x, **kwargs):
2. ls = {"0":"-","1":"--"}
3. plt.axvline(x.mean(), linestyle =ls[kwargs.get("label","0")],
4. color = kwargs.get("color", "r"))
5. txkw = dict(size=15, color = kwargs.get("color", "r"))
6. tx = "mean: {:.1f}%\n(std: {:.1f}%)".format(x.mean()*100,x.std()*100)
7. label_x_pos_adjustment = 0.015
8. label_y_pos_adjustment = 20
9. plt.text(x.mean() + label_x_pos_adjustment, label_y_pos_adjustment, tx, **txkw)
10.
11. sns.set(
12. font_scale=1.5,
13. style="whitegrid"
14. )
15.
16. _ = orders_with_sales_team.groupby('Sales Rep Id').agg({
17. 'Converted': np.mean,
18. 'Company Id': pd.Series.nunique
19. })
20. _.columns = ['conversion rate','number of accounts']
21.
22. g = sns.FacetGrid(_, col="number of accounts", height=4, aspect=0.9, col_wrap=5)
23. g.map(sns.kdeplot, "conversion rate", shade=True)
24. g.set(xlim=(0, 0.35))
25. g.map(vertical_mean_line, "conversion rate")
1. invoices['Date of Meal'] = pd.to_datetime(invoices['Date of Meal'])
2. invoices['Date of Meal'].dt.time.value_counts().sort_index()
看起來我們可以總結一下:
1. invoices['Type of Meal'] = pd.cut(
2. invoices['Date of Meal'].dt.hour,
3. bins=[0,10,15,24],
4. labels=['breakfast','lunch','dinner']
5. )
1. def replace(x):
2. return x.replace("\n ",",").replace("' '","','").replace("'",'"')
3.
4. invoices['Participants'] = invoices['Participants'].apply(lambda x: replace(x))
5. invoices['Number Participants'] = invoices['Participants'].apply(lambda x: len(json.loads(x)))
仍有一些訂單匹配了多個用餐資訊。這可能發生在同時有兩個訂單也有兩次用餐的情況。兩個訂單線索都會匹配兩次用餐。為了去掉那些重複資料,我們只保留與訂單時間最接近的那個訂單。
1. # combine order_leads with invoice data
2. orders_with_invoices = pd.merge(order_leads,invoices,how='left',on='Company Id')
3.
4. # calculate days between order leads and invocies
5. orders_with_invoices['Days of meal before order'] = (
6. pd.to_datetime(orders_with_invoices['Date']) - orders_with_invoices['Date of Meal']
7. ).dt.days
8.
9. # limit to only meals that are within 5 days of the order
10. orders_with_invoices = orders_with_invoices[abs(orders_with_invoices['Days of meal before order']) < 5]
11.
12. # To mnake sure that we don't cross assign meals to multiple orders and therefore create duplicates
13. # we first sort our data by absolute distance to the orders
14. orders_with_invoices = orders_with_invoices.loc[
15. abs(orders_with_invoices['Days of meal before order']).sort_values().index
16. ]
17.
18. # keep the first (i.e. closest to sales event) sales order
19. orders_with_invoices = orders_with_invoices.drop_duplicates(subset=['Order Id'])
20.
21. orders_without_invoices = order_leads[~order_leads['Order Id'].isin(orders_with_invoices['Order Id'].unique())]
22.
23. orders_with_meals = pd.concat([orders_with_invoices,orders_without_invoices],sort=True)
部分合並後資料集
1. def plot_bars(data,x_col,y_col):
2. data = data.reset_index()
3. sns.set(
4. font_scale=1.5,
5. style="whitegrid",
6. rc={'figure.figsize':(20,7)}
7. )
8. g = sns.barplot(x=x_col, y=y_col, data=data, color='royalblue')
9.
10. for p in g.patches:
11. g.annotate(
12. format(p.get_height(), '.2%'),
13. (p.get_x() + p.get_width() / 2., p.get_height()),
14. ha = 'center',
15. va = 'center',
16. xytext = (0, 10),
17. textcoords = 'offset points'
18. )
19.
20. vals = g.get_yticks()
21. g.set_yticklabels(['{:,.0f}%'.format(x*100) for x in vals])
22.
23. sns.despine()
用餐種類的影響:
1. orders_with_meals['Type of Meal'].fillna('no meal',inplace=True)
2. _ = orders_with_meals.groupby('Type of Meal').agg({'Converted': np.mean})
3. plot_bars(_,x_col='Type of Meal',y_col='Converted')
時機的影響(如用餐發生在訂單前或後):
1. _ = orders_with_meals.groupby(['Days of meal before order']).agg(
2. {'Converted': np.mean}
3. )
4. plot_bars(data=_,x_col='Days of meal before order',y_col='Converted'))
合併所有結果:
1. def draw_heatmap(data,inner_row, inner_col, outer_row, outer_col, values):
2. sns.set(font_scale=1)
3. fg = sns.FacetGrid(
4. data,
5. row=outer_row,
6. col=outer_col,
7. margin_titles=True
8. )
9.
10. position = left, bottom, width, height = 1.4, .2, .1, .6
11. cbar_ax = fg.fig.add_axes(position)
12.
13. fg.map_dataframe(
14. draw_heatmap_facet,
15. x_col=inner_col,
16. y_col=inner_row,
17. values=values,
18. cbar_ax=cbar_ax,
19. vmin=0,
20. vmax=.4
21. )
22.
23. fg.fig.subplots_adjust(right=1.3)
24. plt.show()
25.
26. def draw_heatmap_facet(*args, **kwargs):
27. data = kwargs.pop('data')
28. x_col = kwargs.pop('x_col')
29. y_col = kwargs.pop('y_col')
30. values = kwargs.pop('values')
31. d = data.pivot(index=y_col, columns=x_col, values=values)
32. annot = round(d,4).values
33. cmap = sns.color_palette("RdYlGn",30)
34. # cmap = sns.color_palette("PuBu",30) alternative color coding
35. sns.heatmap(d, **kwargs, annot=annot, center=0, fmt=".1%", cmap=cmap, linewidth=.5)
1. # Aggregate the data a bit
2. orders_with_meals['Meal Price / Order Value'] = orders_with_meals['Meal Price']/orders_with_meals['Order Value']
3. orders_with_meals['Meal Price / Order Value'] = pd.qcut(
4. orders_with_meals['Meal Price / Order Value']*-1,
5.
5,
6. labels = ['Least Expensive','Less Expensive','Proportional','More Expensive','Most Expensive'][::-1]
7. )
8.
9. orders_with_meals['Timing of Meal'] = pd.qcut(
10. orders_with_meals['Days of meal before order'],
11. 3,
12. labels = ['After Order','Around Order','Before Order']
13. )
14.
15.
16. data = orders_with_meals[orders_with_meals['Type of Meal'] != 'no meal'].groupby(
17. ['Timing of Meal','Number Participants','Type of Meal','Meal Price / Order Value']
18. ).agg({'Converted': np.mean}).unstack().fillna(0).stack().reset_index()
執行下面的程式碼片段將生成多維熱圖。
1. draw_heatmap(
2. data=data,
3. outer_row='Timing of Meal',
4. outer_col='Type of Meal',
5. inner_row='Meal Price / Order Value',
6. inner_col='Number Participants',
7. values='Converted'
8. )
熱圖很漂亮,但一開始有點難以解讀。讓我們來看一下。
圖表總結了4個不同維度的影響:
用餐時間:訂單後、訂單前後、訂單前(外行)
用餐型別:早餐、晚餐、午餐(外列)
餐費/訂單價值:最低、較低、成比例、較貴、最貴(內行)
參加人數:1,2,3,4,5(內列)
當然,看起來圖表底部的顏色更暗/更高,這表明
在訂單前用餐的轉化率更高
似乎晚餐的轉換率更高,當只有一個人用餐時
看起來相對訂單價值,更貴的餐費對轉化率有積極的影響
…
結果:
1.不要給你的銷售代表超過9個客戶(因為轉化率下降很快);
2.確保每個訂單線索都有會議/用餐(因為這使轉化率翻倍);
3.當只分配一名員工給顧客時,晚餐是最有效的;
4.你的銷售代表應該支付大約為訂單價值8%到10%的餐費;
5.時機是關鍵,理想情況下,讓你的銷售代表儘早知道交易即將達成。
地址:https://github.com/FBosler/Medium-Data-Exploration
熱圖備註:
為了解決潛在的格式化問題,你可以先將其解除安裝(必須在終端中完成),然後再執行以下語句,將MaMattLIB降級到版本3.1.0:
!pip install matplotlib==3.1.0
如果有任何問題請聯絡我。