0 概論
NFL以Game Pass產品作為主要營銷方法(Pat Even,2019),在歐洲的觀眾人數增加。但是,上一季的Game Pass廣告策略仍存在問題。在市場方面,一些國家獲得了更多投資,但投資回報率非常有限,而其他國家則相反。為了解決這個問題,我選擇英國市場進行分析並給出相應的解決方案,以達到擴大市場和提高投資回報率的效果。
0.1 構成
為了解決這個問題,本篇文章分為6個部分。
第1部分是問題的提出以及GamePass的業務概述。
第2部分是MySQL的資料匯入與連線工作。
第3部分是Notebook上資料清洗工作。
第4部分是在Tableau上對GamePass的廣告投放表現做分析,以及指出英國市場的問題。
第5部分是針對英國市場的廣告投放所影響的使用者消費具體資料,進行特徵工程處理,並通過處理過的特徵,進行簡單的機器學習預測。
第6部分則是對使用者進行細分,並針對不同群體的購買金額、購買次數對比,提出相應的簡單營銷策略。
0.2 註明
(本次專案資料來自NFL和TwoCircles。由於NFL規定,本次專案的資料不公開,望諒解)
本次專案在maxaishaojun的Github上可以瀏覽(注意Github移動端瀏覽可能效果不佳)。
"4 問題分析"部分可移步到Tableau Dashboard上瀏覽 NFL GamePass 2018-19賽季廣告投放問題分析。
0.3 引用
Pat, E. (2019) ‘NFL Viewership Growth Throughout Europe Exposes Opportunities in the US’, Front Office Sport, 7 March. Available at: frntofficesport.com/two-circles… (Accessed: 6 April 2019)
1 專案背景
(注:本部分文件來源 - Two Circles)
1.1 問題提出
假設現在是2018/19 NFL賽季的結束,你是NFL Game Pass Europe營銷團隊的成員。 你的任務是分析廣告投放在獲取新的NFL Game Pass訂閱者方面的有效性,通過部署的廣告渠道挖掘購買者的客戶檔案,併為下一季的數字廣告活動提供建議,涵蓋廣告素材,受眾群體,地理位置定位 ,渠道組合和預算。
1.2 專案概述
NFL Game Pass是國外NFL球迷的首選OTT訂閱產品,也是NFL國際業務的重中之重。與“NFL的Netflix”相似,粉絲可以訪問每個實時遊戲,遊戲回放和精彩集錦,NFL RedZone,NFL網路24/7直播電視訊道,原創內容,節目和遊戲檔案,下載等等。
近年來,NFL在歐洲的人氣越來越高,這已反映在非洲大陸的NFL Game Pass的使用者數量上。保持這種使用者增長水平需要採取策略來最小化使用者流失,優化使用者獲取並增加尚未準備好轉換的粉絲群體的考慮。
NFL Game Pass Europe的訂戶收購營銷活動中最大的組成部分是數字廣告,每季佔收購訂戶的約46%。該活動通常在NFL季前賽(8月)開始,並在NFL賽季(9月)的前4周達到峰值強度,最終在超級碗(2月初)之後結束。該活動中部署了各種付費營銷渠道,主要是Google廣告(搜尋/ PPC),Facebook和YouTube(社交)以及各種程式化展示廣告合作伙伴(展示廣告)。此外,展示廣告在www.nfl.com和NFL免費擁有的其他數字資產(O&O)上釋出。預算,廣告創意和受眾群體定位策略因渠道而異,在整個廣告系列中獲得平衡對其成功至關重要。
NFL Game Pass Europe剛剛完成了2018/19賽季的數字廣告活動,並正在進入2019/20賽季的規劃過程。通過訪問豐富的資料集,可以分析廣告效果以及購買者人口統計資料,NFL Game Pass Europe正在尋求挖掘資料集,以獲取有助於未來數字廣告策略的見解。
1.3 附錄 2018/19 NFL Game Pass Europe 產品定義
2018/19 Audience Strategy and Definitions
2018/19 Market Strategy and Definitions 2018/19 Example Ad Units 2018/19 Marketing Plan and Promotions2 資料匯入
-- 1 在MediaPerformanceData中增加numweek與revenue
select
a.date, a.nflweek, a.numweek, a.platform, a.market, a.audience, round(a.`Spend (GBP)`*1.3, 2) as spend_usd, a.impressions, a.clicks, a.transactions, b.revenue_usd
from (
select *,
@number := case
when @weeks = nflweek then @number
else @number+1
end as numweek,
@weeks := nflweek as weeks
from mediaperformancedata
join (select @weeks := null, @number := 0) as variable
order by date
) a
join (
select
mt.date, mt.nflweek, mt.platform, mt.market, mt.audience, round(sum(`Revenue (USD)`),2) as revenue_usd
from
MediaTransactionsdata mt
join
Subscriptionsdata s on mt.transactionID = s.transactionid
group by
mt.date, mt.nflweek, mt.platform, mt.market, mt.audience
order by
mt.date, mt.platform, mt.market, mt.audience
) b
on
(a.date = b.date and a.nflweek = b.nflweek and a.platform = b.platform and a.market = b.market and a.audience = b.audience)
;
-- 2 連線剩下三表
select
mt.transactionid, s.customerid, mt.date, mt.nflweek, mt.platform, mt.market, mt.audience,
c.`NFL Game Pass Segment`, c.gender, c.age, c.`NFL Tickets`, c.`NFL Shop`, c.`NFL Fantasy`, c.`New To NFL Database`, c.`Email Opt-In`, c.`Favourite Team`,
s.SKU, s.`Buy Type`, s.`Converted Free Trial`, s.`Revenue (USD)`
from
Subscriptionsdata s
left join
mediatransactionsdata mt on s.transactionid = mt.transactionid
left join
customersdata c on s.customerid = c.customerid
order by
mt.date
;
複製程式碼
3 資料清洗
import pandas as pd
import numpy as np
mp_file_path = '/Users/apple/Downloads/nfl/mediaperformance.csv'
nfl_file_path = '/Users/apple/Downloads/nfl/nfladverts.csv'
mp_data = pd.read_csv(mp_file_path)
nfl_data = pd.read_csv(nfl_file_path)
# 重新命名列
nfl_data.columns = ['transaction_id', 'customer_id', 'date', 'nflweek', 'platform', 'market',
'audience', 'segment', 'gender', 'age_group', 'tickets',
'shop', 'fantasy', 'new_to_database', 'email_opt_in',
'favourite_team', 'sku', 'buy_type', 'converted_free_trial',
'revenue_usd']
nfl_data['revenue_usd'].describe()
# mp_data.columns
# 重複值
mp_unique = mp_data.groupby(['date', 'nflweek', 'numweek', 'platform', 'market', 'audience']).size().reset_index(name='Freq')
mp_unique = mp_unique.sort_values(by=['Freq'], ascending=False)
print(mp_unique)
nfl_data.head()
nfl_unique = nfl_data.groupby(['transaction_id']).size().reset_index(name='Freq')
nfl_unique = nfl_unique.sort_values(by=['Freq'], ascending=False)
print(nfl_unique)
# data.drop_duplicates(subset ="columns Name", keep = False, inplace = True)
# 缺失值
mp_null_total = mp_data.isnull().sum(axis=0).sort_values(ascending=False)
mp_null_percent = (mp_data.isnull().sum()/len(mp_data.index)).sort_values(ascending=False).round(3)
mp_missing_data = pd.concat([mp_null_total, mp_null_percent], axis=1, keys=['Total', 'Percent'])
nfl_null_total = nfl_data.isnull().sum(axis=0).sort_values(ascending=False)
nfl_null_percent = (nfl_data.isnull().sum()/len(nfl_data.index)).sort_values(ascending=False).round(3)
nfl_missing_data = pd.concat([nfl_null_total, nfl_null_percent], axis=1, keys=['Total', 'Percent'])
# 去掉沒有transaction_id的行
nfl_data = nfl_data.dropna(subset=['transaction_id'])
# 沒有填喜好球隊的編為‘No Team’
nfl_data['favourite_team'] = nfl_data['favourite_team'].fillna('No Team')
# 沒填性別編為'U'
nfl_data['gender'] = nfl_data['gender'].fillna('U')
# 由於gender缺失太多,和其他列也沒有存在明顯的邏輯關係,嘗試看有沒有男女id差異,結果沒有
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html
gender_predict = nfl_data[nfl_data.gender.notnull()]
gender_pivot = pd.pivot_table(gender_predict, values='customer_id', index=['gender'], aggfunc=np.mean)
# 按照邏輯填充buy_type
# nfl_data['buy_type'].unique()
nfl_data['buy_type'] = nfl_data.apply(
lambda row:
'Buy Now' if row['sku'] =='Free' else
('Buy Now' if (row['converted_free_trial'] == 0 and row['revenue_usd'] > 0) else
('Free Trial' if (row['converted_free_trial'] == 0 and row['revenue_usd'] == 0) or (row['converted_free_trial'] == 1 and row['revenue_usd'] > 0) else
row['buy_type'])), axis=1)
# nfl_data.isnull().sum(axis=0).sort_values(ascending=False)
# 異常值
# binary都變成0&1,方便計算;並去掉異常值
# nfl_data['market'].unique()
nfl_data['tickets'] = np.where(nfl_data['tickets'] == 'N', 0, 1)
nfl_data['shop'] = np.where(nfl_data['shop'] == 'N', 0, 1)
nfl_data['fantasy'] = np.where(nfl_data['fantasy'] == False, 0, 1)
nfl_data['new_to_database'] = np.where(nfl_data['new_to_database'] == 'N', 0, 1)
nfl_data['email_opt_in'] = np.where(nfl_data['email_opt_in'] == 'False', 0, np.where(nfl_data['email_opt_in'] == '0', 0, 1))
nfl_data['buy_type'] = np.where(nfl_data['buy_type'] == 'Free Trial', 0, 1)
nfl_data['favourite_team'] = np.where(nfl_data['favourite_team'] == 'NFL', 'No Team', nfl_data['favourite_team'])
nfl_data['favourite_team'] = np.where(nfl_data['favourite_team'] == 'No Team', 0, 1)
# 糾正sku、buy_type、converted_free_trial的邏輯錯誤
# https://stackoverflow.com/questions/36921951/truth-value-of-a-series-is-ambiguous-use-a-empty-a-bool-a-item-a-any-o
# https://discuss.analyticsvidhya.com/t/how-to-resolve-python-error-cannot-compare-a-dtyped-int64-array-with-a-scalar-of-type-bool/73065
nfl_data['sku'] = np.where((nfl_data['revenue_usd'] == 0) & (nfl_data['sku'] != 'Pro'), 'Free', nfl_data['sku'])
nfl_data['sku'] = np.where((nfl_data['revenue_usd'] > 0) & (nfl_data['sku'] == 'Free'), 'Pro', nfl_data['sku'])
nfl_data['buy_type'] = np.where((nfl_data['sku'] == 'Pro') & (nfl_data['revenue_usd'] == 0), 0, nfl_data['buy_type'])
nfl_data['converted_free_trial'] = np.where((nfl_data['sku'] == 'Pro') & (nfl_data['buy_type'] == 0) & (nfl_data['revenue_usd'] > 0), 1, nfl_data['converted_free_trial'])
nfl_data['converted_free_trial'] = np.where((nfl_data['sku'] == 'Pro') & (nfl_data['buy_type'] == 1) & (nfl_data['revenue_usd'] > 0), 0, nfl_data['converted_free_trial'])
mp_data.to_excel(r'/Users/apple/Downloads/nfl/mp_data.xlsx', index=True)
nfl_data.to_csv(r'/Users/apple/Downloads/nfl/nfl_data.csv', index=True)
複製程式碼
4 問題分析
問題分析部分可移步到Tableau Dashboard上瀏覽 NFL GamePass 2018-19賽季廣告投放問題分析。
5 特徵工程
import pandas as pd
from pandas import DataFrame
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
user_file_path = '/Users/apple/Downloads/nfl/nfl_data.csv'
user_data = pd.read_csv(user_file_path)
# extract UK market
uk_data = user_data.loc[user_data['market'] == 'UK']
# uk_data['revenue_usd'].describe()
# user_data.customer_id = user_data.customer_id.astype(str)
uk_data = uk_data.drop(['Unnamed: 0', 'date', 'customer_id', 'audience', 'market', 'nflweek', 'platform'], axis=1)
# uk_data.head()
# 看消費的分佈情況
uk_data.revenue_usd.hist(bins=20, alpha=0.5)
plt.title("Game Pass Europe Revenue Distribution")
plt.xlabel("Revenue($)")
plt.ylabel("Frequency")
複製程式碼
5.1 資料預處理
# 量綱的問題
# 這裡的data都是 yes/no 或者是category的資料,除了revenue_usd之外,沒有continuous data
# 規範化處理(norm)對資料異常值很敏感,處理之後資料中的異常值會消失,因此如果資料集中存在異常值,則這是一種不好的做法。
# 而標準化不受資料限制,所以一般我們採用標準化來處理資料。
def normalize(data, column):
for col in column:
data['normalize_'+col] = (data[col] - np.min(data[col])) / (np.max(data[col]) - np.min(data[col]))
return data
def standardize(data, column):
for col in column:
data['standardize_'+col] = (data[col] - np.mean(data[col])) / (np.std(data[col]))
return data
columns = ['revenue_usd']
uk_data = standardize(uk_data, columns)
# 再介紹幾種啟用函式,其本意都是把數值壓縮在某個區間,其中有的區間敏感,有的不敏感
def tanh(data, column):
for col in column:
data['tanh_'+col] = np.tanh(data[col])
return data
def sigmoid(data, column):
for col in column:
data['sigmoid_'+col] = 1.0 / (1.0 + (np.exp(data[col])*(-1)))
return data
def leakyrelu(data, column, a=1):
for col in column:
data['leakyrelu_'+col] = np.array([x if x > 0 else a * x for x in data[col]])
return data
def softplus(data, column):
for col in column:
data['softplus_'+col] = np.log(np.exp(data[col]) + 1)
return data
uk_data.standardize_revenue_usd.hist(bins=20, alpha=0.5)
plt.title("Game Pass Europe Revenue Distribution")
plt.xlabel("Revenue($)")
plt.ylabel("Frequency")
# 特徵值型別的問題
# category編碼 - dummy coding
# 這種虛擬變數的做法容易增加資料特徵的維度
dummy_data = pd.get_dummies(
uk_data,columns=['segment', 'gender', 'age_group', 'sku'],
prefix=['segment', 'gender', 'age_group', 'sku'],prefix_sep="_"
)
# uk_data = uk_data.drop(['gender_U'], axis=1)
# 連續值轉為category
conditions = [
(uk_data['revenue_usd'] == 0),
(uk_data['revenue_usd'] <= 13),
(uk_data['revenue_usd'] <= 98)
]
choices = [0,1,2]
dummy_data['revenue_category'] = np.select(conditions, choices, default= 3)
uk_data['revenue_category'] = np.select(conditions, choices, default= 3)
fig, ax = plt.subplots()
dummy_data['revenue_category'].value_counts().plot(ax=ax, kind='bar')
複製程式碼
5.2 特徵選擇
# dummy_data.columns
select_feature = ['tickets', 'shop', 'fantasy', 'new_to_database', 'email_opt_in',
'favourite_team', 'buy_type', 'converted_free_trial', 'segment_Acq',
'segment_Ret', 'segment_iOS', 'gender_F', 'gender_M', 'gender_U',
'age_group_18-21', 'age_group_22-25', 'age_group_26-30',
'age_group_31-35', 'age_group_36-40', 'age_group_41-50',
'age_group_51-60', 'age_group_60+', 'age_group_Under 18',
'age_group_Unknown', 'sku_Essential', 'sku_Free', 'sku_Playoffs',
'sku_Pro', 'sku_Super Bowl', 'sku_Weekly']
# 方差選擇法
from sklearn.feature_selection import VarianceThreshold
varianceThreshold = VarianceThreshold(threshold = 0.2)
varianceThreshold.fit_transform(dummy_data[select_feature])
var_result = varianceThreshold.get_support()
# 相關係數法
# 選擇基本的feature,然後匹配其他
# 注意邏輯關係
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
selectKBest = SelectKBest(f_regression, k=10)
feature = dummy_data[select_feature]
bestFeature = selectKBest.fit_transform(feature, dummy_data[['revenue_usd']])
feature_result = selectKBest.get_support()
def feature_results(list_feature, list_result):
dic = {}
for i in range(len(list_feature)):
feature = list_feature[i]
result = list_result[i]
dic[feature] = result
result_tuple = sorted(dic.items(), key=lambda kv: kv[1])
return result_tuple
var_results = feature_results(select_feature, var_result)
weights_results = feature_results(select_feature, feature_result)
print(var_results[-5:-1])
print(weights_results[-10:-1])
複製程式碼
5.3 維度下降
# 邏輯降維
# 根據發散性和關聯性的結果,對願特徵進行處理
# 合併tickets、shop、fantasy、new_to_database、email_opt_in,以提高相關性
# 如果可以的話應該加上權重
# uk_data.sku.unique()
uk_data['user_behaviour'] = uk_data['tickets'] + uk_data['shop'] + uk_data['fantasy'] + uk_data['new_to_database'] + uk_data['email_opt_in']
# age_group 分成老中青 0-21 22-40 40+
uk_data['age'] = np.where((uk_data['age_group']=='Under 18') | (uk_data['age_group']=='18-21'), 'young',
np.where((uk_data['age_group']=='22-25') | (uk_data['age_group']=='26-30') | (uk_data['age_group']=='31-35') | (uk_data['age_group']=='36-40'), 'adult',
'old'))
# sku, buy_ype & converted_free_trial 是屬於購買行為了,應該歸在target_variable中
uk_data['sku_category'] = np.where((uk_data['sku']=='Pro') & (uk_data['buy_type']== 1), 'Pro-BuyNow',
np.where((uk_data['sku']=='Pro') & (uk_data['buy_type']== 0) & (uk_data['converted_free_trial']== 0), 'Pro-FreeTrial-NoConvert',
np.where((uk_data['sku']=='Pro') & (uk_data['buy_type']== 0) & (uk_data['converted_free_trial']== 1), 'Pro-FreeTrial-Convert',
uk_data['sku'])))
uk_data = uk_data.drop(['tickets', 'shop', 'fantasy', 'new_to_database', 'email_opt_in',
'sku', 'buy_type', 'converted_free_trial', 'age_group', 'standardize_revenue_usd'], axis=1)
uk_data = uk_data[['transaction_id', 'age','gender', 'favourite_team', 'user_behaviour', 'segment', 'revenue_usd', 'sku_category', 'revenue_category']]
uk_data.head(10)
# 啞編碼,以適應某些模型不接受categorical data
dummy_data = pd.get_dummies(
uk_data,columns=['age', 'gender', 'segment'],
prefix=['age', 'gender', 'segment'],prefix_sep="_"
)
# 將啞編碼與原資料結合
feature_data = pd.merge(dummy_data, uk_data[['transaction_id', 'age', 'gender', 'segment']], on='transaction_id', how='inner')
feature_data.head()
# 透視表
"""
feature_data['transaction_count'] = 1
pd.pivot_table(feature_data,
columns=["age"],
index = ['favourite_team'],
values=['revenue_usd', 'transaction_count'],
aggfunc=[np.mean,np.sum])
"""
# 相關性視覺化
"""
variables = ['favourite_team', 'user_behaviour', 'revenue_usd', 'age_adult', 'age_old', 'age_young',
'gender_F', 'gender_M', 'segment_Acq', 'segment_Ret', 'segment_iOS']
sns.set()
sns.pairplot(feature_data[variables], size = 2.5)
plt.show()
"""
# 多重共線性檢驗
"""
x = feature_data[['favourite_team', 'user_behaviour', 'age_adult', 'age_old', 'age_young',
'gender_F', 'gender_M', 'segment_Acq', 'segment_Ret', 'segment_iOS']]
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(x.values, i) for i in range(x.shape[1])]
vif["features"] = x.columns
vif.round(1)
feature_data.to_csv(r'/Users/apple/Downloads/nfl/feature_data.csv', index=True)
"""
# 演算法試驗
# 檢驗方法-train-validation
from sklearn.model_selection import train_test_split
x = feature_data[['favourite_team', 'user_behaviour', 'age_adult', 'age_old', 'age_young',
'gender_F', 'gender_M', 'segment_Acq', 'segment_Ret', 'segment_iOS']]
y1 = feature_data[['revenue_usd']] #regression
y2 = feature_data[['revenue_category']] #classification
x_train1, x_val1, y_train1, y_val1 = train_test_split(x, y1, test_size=0.2, random_state=1)
x_train2, x_val2, y_train2, y_val2 = train_test_split(x, y2, test_size=0.2, random_state=1)
print("the number of data for training:")
print(y_train1.count())
print("the number of data for validation:")
print(y_val1.count())
#衡量方法-accuracy與RMSE
from sklearn.metrics import mean_squared_error
def rmse_model(model, x, y):
predictions = model.predict(x)
rmse = np.sqrt(mean_squared_error(predictions, y))
return rmse
"""
from sklearn import metrics
def confusion_matrix(model, x, y):
model_confusion_test = metrics.confusion_matrix(y, model.predict(x))
matrix = pd.DataFrame(data = model_confusion_test, columns = ['Predicted 0', 'Predicted 1', 'Predicted 2', 'Predicted 3'],
index = ['Predicted 0', 'Predicted 1', 'Predicted 2', 'Predicted 3'])
return matrix
"""
# regression
# 由於運算量和時間問題,就只放上程式碼了
"""
from sklearn.linear_model import LinearRegression
linear_regression = LinearRegression()
linear_regression.fit(x_train1, y_train1)
print(rmse_model(linear_regression, x_val1, y_val1))
"""
# bias-variance trade-off
"""
from sklearn.preprocessing import PolynomialFeatures
train_rmses = []
val_rmses = []
degrees = range(1,8)
for i in degrees:
poly = PolynomialFeatures(degree=i, include_bias=False)
x_train_poly = poly.fit_transform(x_train1)
poly_reg = LinearRegression()
poly_reg.fit(x_train_poly, y_train1)
# training RMSE
y_train_pred = poly_reg.predict(x_train_poly)
train_poly_rmse = np.sqrt(mean_squared_error(y_train1, y_train_pred))
train_rmses.append(train_poly_rmse)
# validation RMSE
x_val_poly = poly.fit_transform(x_val1)
y_val_pred = poly_reg.predict(x_val_poly)
val_poly_rmse = np.sqrt(mean_squared_error(y_val1, y_val_pred))
val_rmses.append(val_poly_rmse)
print('degree = %s, training RMSE = %.2f, validation RMSE = %.2f' % (i, train_poly_rmse, val_poly_rmse))
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(degrees, train_rmses,label= 'training set')
ax.plot(degrees, val_rmses,label= 'validation set')
ax.set_yscale('log')
ax.set_xlabel('Degree')
ax.set_ylabel('RMSE')
ax.set_title('Bias/Variance Trade-off')
plt.legend()
plt.show()
"""
# regularization in order to reduce the effect of overfitting
# ridge (lasso, elasticnet程式碼相似,lasso和elasticnet對RMSE壓縮得更狠一些,不過不會去掉有collinearity的feature,ridge相反)
"""
from sklearn.linear_model import Ridge
from sklearn.pipeline import make_pipeline
rmse=[]
alpha=[1, 2, 5, 10, 20, 30, 40, 50, 75, 100]
for a in alpha:
ridge = make_pipeline(PolynomialFeatures(4), Ridge(alpha=a))
ridge.fit(x_train1, y_train1)
predict=ridge.predict(x_val1)
rmse.append(np.sqrt(mean_squared_error(predict, y_val1)))
print(rmse)
plt.scatter(alpha, rmse)
alpha=np.arange(20, 60, 2)
rmse=[]
for a in alpha:
#ridge=Ridge(alpha=a, copy_X=True, fit_intercept=True)
#ridge.fit(x_train1, y_train1)
ridge = make_pipeline(PolynomialFeatures(4), Ridge(alpha=a))
ridge.fit(x_train1, y_train1)
predict=ridge.predict(x_val1)
rmse.append(np.sqrt(mean_squared_error(predict, y_val1)))
print(rmse)
plt.scatter(alpha, rmse)
ridge = make_pipeline(PolynomialFeatures(4), Ridge(alpha=24.6))
ridge_model = ridge.fit(x_train1, y_train1)
predictions = ridge_model.predict(x_val1)
print("Ridge RMSE is: " + str(rmse_model(ridge_model, x_val1, y_val1)))
"""
# classfication
list(y_train2['revenue_category'].unique())
# decision tree
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
decision_tree_model = DecisionTreeClassifier(criterion='entropy')
decision_tree_model.fit(x_train2, y_train2)
print(decision_tree_model.score(x_train2,y_train2))
print(decision_tree_model.score(x_val2,y_val2))
# tuning
train_score = []
val_score = []
for depth in np.arange(1,20):
decision_tree = tree.DecisionTreeClassifier(max_depth = depth,min_samples_leaf = 5)
decision_tree.fit(x_train2, y_train2)
train_score.append(decision_tree.score(x_train2, y_train2))
val_score.append(decision_tree.score(x_val2, y_val2))
plt.plot(np.arange(1,20),train_score)
plt.plot(np.arange(1,20),val_score)
plt.legend(['Training Accuracy','Validation Accuracy'])
plt.title('Decision Tree Tuning')
plt.xlabel('Depth')
plt.ylabel('Accuracy')
train_score = []
val_score = []
for leaf in np.arange(20,100):
decision_tree = tree.DecisionTreeClassifier(max_depth = 10, min_samples_leaf = leaf)
decision_tree.fit(x_train2, y_train2)
train_score.append(decision_tree.score(x_train2, y_train2))
val_score.append(decision_tree.score(x_val2, y_val2))
plt.plot(np.arange(20,100),train_score)
plt.plot(np.arange(20,100),val_score)
plt.legend(['Training Accuracy','Validation Accuracy'])
plt.title('Decision Tree Tuning')
plt.xlabel('Minimum Samples Leaf')
plt.ylabel('Accuracy')
my_decision_tree_model = DecisionTreeClassifier(max_depth = 10, min_samples_leaf = 20)
my_decision_tree_model.fit(x_train2, y_train2)
print(my_decision_tree_model.score(x_train2,y_train2))
print(my_decision_tree_model.score(x_val2,y_val2))
# confusion matrix
from sklearn.metrics import accuracy_score, confusion_matrix, precision_recall_fscore_support
y_predict = my_decision_tree_model.predict(x_val2)
cm = confusion_matrix(y_val2, y_predict)
# Transform to df for easier plotting
cm_df = pd.DataFrame(cm,
index = ['free', 'median', 'high', 'low'],
columns = ['free', 'median', 'high', 'low'])
plt.figure(figsize=(5.5,4))
sns.heatmap(cm_df, annot=True)
plt.title('Decision Tree \nAccuracy:{0:.3f}'.format(accuracy_score(y_val2, y_predict)))
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()
# learning curve
from sklearn.model_selection import learning_curve
train_sizes, train_scores, val_scores = learning_curve(DecisionTreeClassifier(max_depth = 10, min_samples_leaf = 20),
x,
y2,
# Number of folds in cross-validation
cv=5,
# Evaluation metric
scoring='accuracy',
# Use all computer cores
n_jobs=-1,
# 50 different sizes of the training set
train_sizes=np.linspace(0.1, 1.0, 5))
# Create means and standard deviations of training set scores
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
# Create means and standard deviations of validation set scores
val_mean = np.mean(val_scores, axis=1)
val_std = np.std(val_scores, axis=1)
# Draw lines
plt.plot(train_sizes, train_mean, '--', color="#ff8040", label="Training score")
plt.plot(train_sizes, val_mean, color="#40bfff", label="Cross-validation score")
# Draw bands
plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, color="#DDDDDD")
plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std, color="#DDDDDD")
# Create plot
plt.title("Learning Curve \n k-fold=5, number of neighbours=5")
plt.xlabel("Training Set Size"), plt.ylabel("Accuracy Score"), plt.legend(loc="best")
plt.tight_layout()
plt.show()
# Curse of Dimensionality
d_train = []
d_val = []
for i in range(1,9):
X_train_index = x_train2.iloc[: , 0:i]
X_val_index = x_val2.iloc[: , 0:i]
classifier = DecisionTreeClassifier(max_depth = 10, min_samples_leaf = 20)
dt_model = classifier.fit(X_train_index, y_train2.values.ravel())
d_train.append(dt_model.score(X_train_index, y_train2))
d_val.append(dt_model.score(X_val_index, y_val2))
plt.title('Decision Tree Curse of Dimensionality')
plt.plot(range(1,9),d_val,label="Validation")
plt.plot(range(1,9),d_train,label="Train")
plt.xlabel('Number of Features')
plt.ylabel('Score (Accuracy)')
plt.legend()
plt.xticks(range(1,9))
plt.show()
# 預測結果很差,可能和feature選擇有關係,也有可能和演算法不適合有關係。
複製程式碼
6 使用者細分及建議
6.1 使用者細分
通過特徵工程後,我們這一步的工作就是將我們的使用者細分,並根據不同的使用者群體提出不同的策略。在通過Excel PivotTable & 視覺化操作後,以喜歡球隊、NFL消費行為、使用者segment、年齡、性別為比較維度,購買金額(平均)、購買次數為目標維度,構成矩陣。
6.2 Marketing建議
基於英國在Game Pass歐洲市場戰略的地位,其目的應該是讓越來越多的人蔘與到NFL中的同時,提高ROI。因此,在不考慮Budget的情況下,應該想辦法改進四個群體的消費額與使用者體驗。
¶ 對於明星使用者(中老年男性,有喜歡的球隊),應該保持和提高使用者的產品體驗,並組織球迷活動,增進關係,提高消費意願。
¶ 對於便宜套餐使用者(男性新使用者,沒有喜歡的球隊),其對GamePass還處在觀望的狀態,或者歸屬感不強影響消費意願(喜歡看比賽,但是更多的是尋找免費套餐)。這個時候可以線上上線下舉行增進球迷關係的活動,或者在賽季期間搞價格促銷活動,提高購買意願。另一方面,其購買能力也有可能影響消費金額,不過可以通過廣告從而為GamePass間接獲取流量收入。
¶ 對於“只花一次大價錢”使用者,首先明確他們的消費能力應該是沒問題的。應該舉行調研,為什麼願意花大價錢,但是為什麼只消費了一次?如何提高這部分使用者對產品的體驗從而促使消費次數增加?
¶ 而對於瘦狗使用者(中老年女性or青年,沒有喜歡球隊),首先要做的是NFL對於這部分群體的普及營銷工作(如果不喜歡NFL,怎麼會用GamePass呢),包括推出關於女性、青少年的NFL宣傳廣告(提高印象)、舉行多次推廣活動(提高認知,並鼓勵使用者參與)、以及公關活動,以提升這部分群體對NFL的功能認知和道德認知。