手把手教你怎樣用Python生成漂亮且精闢的影像（附教程程式碼）

資料派THU發表於2019-10-08

原文網址 : http://www.jiqizhixin.com/articles/2019-10-08-3

手把手教你怎樣用Python生成漂亮且精闢的影像（附教程程式碼）在上週的文章《用python從不同的表單中提取資料》中，學習瞭如何從不同的源（Google Sheets、CSV和Excel）檢索和統一資料。本教程與上一篇文章是相互獨立的，所以你不必擔心錯過了上週的文章。

在今天的教程中，你將會學到：

如何預處理和合並資料，
如何探索並分析資料，
如何做出漂亮的圖表對結果進行視覺化。

這篇教程面向：

經常從事資料相關工作，
對Python和Pandas有初步理解的人。

情景概述：

你的任務是找出提高你的銷售團隊業績的方法。在我們所假設的情況下，潛在客戶有相當自發的需求。當客戶提出需求時，你的銷售團隊會在系統中設定一個訂單商機。然後，你的銷售代表安排一次會議，會議將在發現訂單商機前後舉行。你的銷售代表有一個開支預算，預算中包括會議費用和餐費。銷售代表支付這些花銷並將發票交給會計團隊處理。在潛在客戶決定是否願意接受你的報價後，勤勞的銷售代表會跟蹤訂單商機是否轉化為銷售。

你可以使用以下三個資料集進行分析：

order_leads（包含所有訂單線索和轉化資訊）
sales_team（包括公司和負責的銷售代表）
invoices（提供發票和參與者的資訊）

載入程式包和屬性設定：

1. import json  
2. import pandas as pd  
3. import numpy as np  
4. %matplotlib inline  
5. import matplotlib.pyplot as plt  
6. import seaborn as sns  
7. sns.set(  
8.     font_scale=1.5,  
9.     style="whitegrid",  
10.     rc={'figure.figsize':(20,7)}  
11. )

這裡用到都是相當標準的庫。你有可能會需要執行下面的命令來在你的Notebook裡安裝seaborn。

1. !pip install seaborn

載入資料：

你可以下載併合並上周文章中的例項資料，或者點選這裡下載檔案並將其載入到Notebook中。

https://github.com/FBosler/Medium-Data-Exploration

1. sales_team = pd.read_csv('sales_team.csv') 
2. order_leads = pd.read_csv('order_leads.csv') 
3. invoices = pd.read_csv('invoices.csv')

手把手教你怎樣用Python生成漂亮且精闢的影像（附教程程式碼）

sales_team資料集的前兩行

手把手教你怎樣用Python生成漂亮且精闢的影像（附教程程式碼）

order_leads資料集的前兩行

手把手教你怎樣用Python生成漂亮且精闢的影像（附教程程式碼）

invoice資料集的前兩行

…

開始探索資料：

總轉化率的發展趨勢：

手把手教你怎樣用Python生成漂亮且精闢的影像（附教程程式碼）

轉化率隨時間的變化

顯然，從2017年初開始轉化率似乎有所下降。與首席銷售官核實後發現，當時有一個競爭對手進入市場。很高興知道這點，但我們現在對此無能為力。

1. _ = order_leads.set_index(pd.DatetimeIndex(order_leads.Date)).groupby( 
2.     pd.Grouper(freq='D')  
3. )['Converted'].mean()  
4.   
5. ax = _.rolling(60).mean().plot(figsize=(20,7),title='Conversion Rate Over Time')  
6.   
7. vals = ax.get_yticks() 
8. ax.set_yticklabels(['{:,.0f}%'.format(x*100) for x in vals])  
9. sns.despine()

1.我們使用下劃線“_”作為臨時變數。我通常會這樣生成以後不會再使用的一次性變數。
2.我們對order_leads.Date使用pd.DateTimeIndex，將其設定為序號。
3.使用pd.grouped（freq='D'）按天對資料進行分組。或者，你可以將頻率更改為W、M、Q或Y（周、月、季或年）。
4.我們計算每天“轉化”的平均值，即當天訂單的轉化率。
5.我們使用.rolling（60）和.mean（）得到60天的平均值。
6.然後，我們設定yticklables的格式，使其顯示百分比符號。

不同銷售代表的轉化率：

手把手教你怎樣用Python生成漂亮且精闢的影像（附教程程式碼）

銷售代表之間的轉化率似乎有很大的差異，我們再調查一下。

1. orders_with_sales_team = pd.merge(order_leads,sales_team,on=['Company Id','Company Name'])  
2. ax = sns.distplot(orders_with_sales_team.groupby('Sales Rep Id')['Converted'].mean(),kde=False)  3. vals = ax.get_xticks() 
4. ax.set_xticklabels(['{:,.0f}%'.format(x*100) for x in vals])  
5. ax.set_title('Number of sales reps by conversion rate')  
6. sns.despine()

就使用的函式而言，這裡沒有太多的新函式。但請注意我們如何使用sns.distplot將資料繪製到軸上。

如果我們回憶銷售團隊的資料，我們記得並非所有的銷售代表都有相同數量的客戶，這肯定會對結果有影響！讓我們檢查一下。

手把手教你怎樣用Python生成漂亮且精闢的影像（附教程程式碼）

不同分配賬戶數量的轉化率

我們可以看到，轉化率的數量似乎與分配給銷售代表的帳戶數量成反比，那些降低的轉換率是有意義的。畢竟，銷售代表的賬戶越多，他在每個人身上花費的時間就越少。


1. def vertical_mean_line(x, **kwargs):  
2.     ls = {"0":"-","1":"--"}  
3.     plt.axvline(x.mean(), linestyle =ls[kwargs.get("label","0")],   
4.                 color = kwargs.get("color", "r"))  
5.     txkw = dict(size=15, color = kwargs.get("color", "r"))  
6.     tx = "mean: {:.1f}%\n(std: {:.1f}%)".format(x.mean()*100,x.std()*100)  
7.     label_x_pos_adjustment = 0.015   
8.     label_y_pos_adjustment = 20  
9.     plt.text(x.mean() + label_x_pos_adjustment, label_y_pos_adjustment, tx, **txkw)  
10.   
11. sns.set(  
12.     font_scale=1.5,  
13.     style="whitegrid"  
14. )  
15.   
16. _ = orders_with_sales_team.groupby('Sales Rep Id').agg({  
17.     'Converted': np.mean,  
18.     'Company Id': pd.Series.nunique  
19. })  
20. _.columns = ['conversion rate','number of accounts']  
21.   
22. g = sns.FacetGrid(_, col="number of accounts", height=4, aspect=0.9, col_wrap=5)  
23. g.map(sns.kdeplot, "conversion rate", shade=True)  
24. g.set(xlim=(0, 0.35))  
25. g.map(vertical_mean_line, "conversion rate")

在這裡，我們先建立一個函式，它將把垂直線對映到每個子塊中，並用資料的平均值和標準偏差來註釋這條線。然後，我們設定一些seaborn繪圖預設值，如較大的字型font_scale和白色網格作為樣式 style。

用餐的影響：

手把手教你怎樣用Python生成漂亮且精闢的影像（附教程程式碼）

用餐資料示例

看起來我們有用餐日期和時間的資料，來快速看一下時間的分佈：

1. invoices['Date of Meal'] = pd.to_datetime(invoices['Date of Meal']) 
2. invoices['Date of Meal'].dt.time.value_counts().sort_index()

out:

07:00:00 5536

08:00:00 5613

09:00:00 5473

12:00:00 5614

13:00:00 5412

14:00:00 5633

20:00:00 5528

21:00:00 5534

22:00:00 5647

看起來我們可以總結一下：

1. invoices['Type of Meal'] = pd.cut(  
2.     invoices['Date of Meal'].dt.hour,  
3.     bins=[0,10,15,24],  
4.     labels=['breakfast','lunch','dinner']  5. )

請注意如何使用pd.cut將連續變數分組，這樣做的意義是早餐是8點還是9點開始可能並不重要。

另外，請注意如何使用.dt.hour，我們只能這樣做，因為我們將invoices[‘Date of Meal’]轉換為日期時間。.dt是一個“訪問器”，一共有三類cat，str，dt。如果你的資料是正確的型別，則可以使用這些訪問器及其方法進行直接操作（計算效率高且簡潔）。

不湊巧的是，我們必須把第一個字串invoices['Participants']轉換成合法的JSON，這樣可以提取參與者的數量。

1. def replace(x):  
2.     return x.replace("\n ",",").replace("' '","','").replace("'",'"')  
3.   
4. invoices['Participants'] = invoices['Participants'].apply(lambda x: replace(x))  
5. invoices['Number Participants'] = invoices['Participants'].apply(lambda x:  len(json.loads(x)))

現在來合併資料。為此，我們先將所有invoice資料與order_leads資料通過 company Id左連線。然而，合併資料會導致所有的用餐資料都匹配到訂單上，也有些很久以前的用餐匹配到新的訂單資料上。為了減少這種情況，我們計算了用餐和訂單之間的時間差，並且只考慮在訂單前後5天的用餐。

仍有一些訂單匹配了多個用餐資訊。這可能發生在同時有兩個訂單也有兩次用餐的情況。兩個訂單線索都會匹配兩次用餐。為了去掉那些重複資料，我們只保留與訂單時間最接近的那個訂單。

1. # combine order_leads with invoice data 
2. orders_with_invoices = pd.merge(order_leads,invoices,how='left',on='Company Id')  
3.  
4. # calculate days between order leads and invocies  
5. orders_with_invoices['Days of meal before order'] = (  
6.     pd.to_datetime(orders_with_invoices['Date']) - orders_with_invoices['Date of Meal']  
7. ).dt.days  
8.   
9. # limit to only meals that are within 5 days of the order  
10. orders_with_invoices = orders_with_invoices[abs(orders_with_invoices['Days of meal before order']) < 5]  
11.   
12. # To mnake sure that we don't cross assign meals to multiple orders and therefore create duplicates  
13. # we first sort our data by absolute distance to the orders  
14. orders_with_invoices = orders_with_invoices.loc[  
15.     abs(orders_with_invoices['Days of meal before order']).sort_values().index  
16. ]  
17.   
18. # keep the first (i.e. closest to sales event) sales order  
19. orders_with_invoices = orders_with_invoices.drop_duplicates(subset=['Order Id'])  
20.   
21. orders_without_invoices = order_leads[~order_leads['Order Id'].isin(orders_with_invoices['Order Id'].unique())]  22.   
23. orders_with_meals = pd.concat([orders_with_invoices,orders_without_invoices],sort=True)

手把手教你怎樣用Python生成漂亮且精闢的影像（附教程程式碼）

部分合並後資料集

我建立了一個柱狀圖函式，其中已經包含一些樣式。通過該函式進行繪圖，可以使視覺化更快捷。我們現在就來使用這個函式。

1. def plot_bars(data,x_col,y_col):  
2.     data = data.reset_index()  
3.     sns.set(  
4.         font_scale=1.5,  
5.         style="whitegrid",  
6.         rc={'figure.figsize':(20,7)}  
7.     )  
8.     g = sns.barplot(x=x_col, y=y_col, data=data, color='royalblue')  
9.   
10.     for p in g.patches:  
11.         g.annotate(  
12.             format(p.get_height(), '.2%'), 
13.             (p.get_x() + p.get_width() / 2., p.get_height()),   
14.             ha = 'center',   
15.             va = 'center',   
16.             xytext = (0, 10),   
17.             textcoords = 'offset points'  
18.         )  
19.           
20.     vals = g.get_yticks()  
21.     g.set_yticklabels(['{:,.0f}%'.format(x*100) for x in vals]) 
22.   
23.     sns.despine()

用餐種類的影響：

手把手教你怎樣用Python生成漂亮且精闢的影像（附教程程式碼）

1. orders_with_meals['Type of Meal'].fillna('no meal',inplace=True)  
2. _ = orders_with_meals.groupby('Type of Meal').agg({'Converted': np.mean})  
3. plot_bars(_,x_col='Type of Meal',y_col='Converted')

真的！有沒有用餐資訊的訂單轉化率有明顯的差別。不過，午餐的轉化率似乎略低於晚餐或早餐。

時機的影響（如用餐發生在訂單前或後）：

手把手教你怎樣用Python生成漂亮且精闢的影像（附教程程式碼）

1. _ = orders_with_meals.groupby(['Days of meal before order']).agg(  
2.     {'Converted': np.mean}  
3. )  
4. plot_bars(data=_,x_col='Days of meal before order',y_col='Converted'))

“用餐在訂單前幾天發生”為負數意味著用餐是在訂單線索出現之後。我們可以看到，如果用餐在在訂單線索出現前發生似乎對轉化率有一個積極的影響，看來對訂單的事先了解使我們的銷售代表更有優勢。

合併所有結果：

現在我們將使用熱圖同時顯示資料的多個維度。為此我們先建立一個函式。

1. def draw_heatmap(data,inner_row, inner_col, outer_row, outer_col, values):  
2.     sns.set(font_scale=1)  
3.     fg = sns.FacetGrid(  
4.         data,   
5.         row=outer_row,  
6.         col=outer_col,   
7.         margin_titles=True  
8.     )  
9.   
10.     position = left, bottom, width, height = 1.4, .2, .1, .6  
11.     cbar_ax = fg.fig.add_axes(position)   
12.   
13.     fg.map_dataframe(  
14.         draw_heatmap_facet,   
15.         x_col=inner_col,  
16.         y_col=inner_row,   
17.         values=values,   
18.         cbar_ax=cbar_ax,  
19.         vmin=0,   
20.         vmax=.4  
21.     )  
22.   
23.     fg.fig.subplots_adjust(right=1.3)    
24.     plt.show()  
25.   
26. def draw_heatmap_facet(*args, **kwargs):  
27.     data = kwargs.pop('data')  
28.     x_col = kwargs.pop('x_col')  
29.     y_col = kwargs.pop('y_col')  
30.     values = kwargs.pop('values')  
31.     d = data.pivot(index=y_col, columns=x_col, values=values)  
32.     annot = round(d,4).values  
33.     cmap = sns.color_palette("RdYlGn",30)  
34.     # cmap = sns.color_palette("PuBu",30) alternative color coding  
35.     sns.heatmap(d, **kwargs, annot=annot, center=0, fmt=".1%", cmap=cmap, linewidth=.5)

然後，我們應用一些資料處理來探索用餐花銷與訂單價值的關係，並將我們的用餐時間劃分為訂單前（Before Order）、訂單前後（Around Order）、訂單後（After Order），而不是從負4到正4的天數，因為這解讀起來會比較麻煩。

1. # Aggregate the data a bit  
2. orders_with_meals['Meal Price / Order Value'] = orders_with_meals['Meal Price']/orders_with_meals['Order Value']  3. orders_with_meals['Meal Price / Order Value'] = pd.qcut(  
4.     orders_with_meals['Meal Price / Order Value']*-1,  5.     
5,  
6.     labels = ['Least Expensive','Less Expensive','Proportional','More Expensive','Most Expensive'][::-1]  7. )  8.   9. orders_with_meals['Timing of Meal'] = pd.qcut(  10.     orders_with_meals['Days of meal before order'],  11.     3,  12.     labels = ['After Order','Around Order','Before Order']  13. )  14.   15.   16. data = orders_with_meals[orders_with_meals['Type of Meal'] != 'no meal'].groupby(  17.     ['Timing of Meal','Number Participants','Type of Meal','Meal Price / Order Value']  18. ).agg({'Converted': np.mean}).unstack().fillna(0).stack().reset_index()

執行下面的程式碼片段將生成多維熱圖。

1. draw_heatmap(  
2.     data=data,   
3.     outer_row='Timing of Meal',  
4.     outer_col='Type of Meal',  
5.     inner_row='Meal Price / Order Value',  
6.     inner_col='Number Participants',  
7.     values='Converted'  
8. )

手把手教你怎樣用Python生成漂亮且精闢的影像（附教程程式碼）

熱圖很漂亮，但一開始有點難以解讀。讓我們來看一下。

圖表總結了4個不同維度的影響：

用餐時間：訂單後、訂單前後、訂單前（外行）
用餐型別：早餐、晚餐、午餐（外列）
餐費/訂單價值：最低、較低、成比例、較貴、最貴（內行）
參加人數：1,2,3,4,5（內列）

當然，看起來圖表底部的顏色更暗/更高，這表明

在訂單前用餐的轉化率更高
似乎晚餐的轉換率更高，當只有一個人用餐時
看起來相對訂單價值，更貴的餐費對轉化率有積極的影響

…

結果：

1.不要給你的銷售代表超過9個客戶（因為轉化率下降很快）；

2.確保每個訂單線索都有會議/用餐（因為這使轉化率翻倍）；

3.當只分配一名員工給顧客時，晚餐是最有效的；

4.你的銷售代表應該支付大約為訂單價值8%到10%的餐費；

5.時機是關鍵，理想情況下，讓你的銷售代表儘早知道交易即將達成。

點選這裡獲得以上程式碼：GitHub Repo/Jupyter Notebook

地址：https://github.com/FBosler/Medium-Data-Exploration

熱圖備註：

為了解決潛在的格式化問題，你可以先將其解除安裝（必須在終端中完成），然後再執行以下語句，將MaMattLIB降級到版本3.1.0：

!pip install matplotlib==3.1.0

如果有任何問題請聯絡我。

原文連結：

https://towardsdatascience.com/how-to-explore-and-visualize-a-dataset-with-python-7da5024900ef

手把手教你怎樣用Python生成漂亮且精闢的影像（附教程程式碼）

情景概述：

載入程式包和屬性設定：

載入資料：

開始探索資料：

不同銷售代表的轉化率：

用餐種類的影響：

時機的影響（如用餐發生在訂單前或後）：

合併所有結果：

結果：

相關文章