資料探勘實戰 - 天池新人賽o2o優惠券使用預測

很隨便的wei發表於2021-12-14

原文網址 : https://www.cnblogs.com/boostwei/p/15689568.html

資料探勘實戰 - o2o優惠券使用預測

一、前言

大家好，家人們。今天是2021/12/14號。上次更新是2021/08/29。上篇文章中說到要開兩個專題，果不其然我鴿了，這一鴿就是三個多月。今天，我不鴿（還要鴿）。那兩個專題關於ResNet和GoogLeNet的文章還等緩緩一緩（一月份一定發），今天這篇文章是關於資料探勘實戰入門的例子，題目及資料集來源於天池新人實戰賽o2o優惠券使用預測，題目地址：https://tianchi.aliyun.com/competition/entrance/231593/introduction?spm=5176.12281973.1005.2.3dd52448rilGd8

二、賽題簡介

賽題的主要任務就是，根據提供的資料來分析建模，精準預測使用者在2016年7月領取優惠券15以內的使用情況，是否會在規定時間內使用相應優惠券。官網給的資料集主要有：

ccf_offline_stage1_test_revised.csv : 使用者線下優惠券使用預測樣本
cff_offline_stage1_train.zip：使用者線下消費和優惠券領取行為
cff_online_stage1_train.zip：使用者線上點選/消費和優惠券領取行為
sample_submission.csv：提交格式

具體屬性特徵詳情請自行在網站中瀏覽：https://tianchi.aliyun.com/competition/entrance/231593/information，還有評價指標等這些資訊大家自己在天池比賽官網裡看一下吧，就不多說了。

三、程式碼例項

匯入第三方庫以及讀入資料

import os, sys, pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import date
from sklearn.linear_model import SGDClassifier, LogisticRegression
import seaborn as sns
# 顯示中文
plt.rcParams['font.sans-serif'] = [u'SimHei']
plt.rcParams['axes.unicode_minus'] = False

dfoff = pd.read_csv('./ccf_offline_stage1_train.csv')
dftest = pd.read_csv('./ccf_offline_stage1_test_revised.csv')
dfon = pd.read_csv('./ccf_online_stage1_train.csv')
print('data read end.')

2. 簡單觀察資料特徵

# 簡單的觀察資料特徵
print("dfoff的shape是",dfoff.shape)
print("dftest的shape是",dftest.shape)
print("dfon的shape是",dfon.shape)
print(dfoff.describe())
print(dftest.describe())
print(dfon.describe())

dfoff.head()

3. 使用者線下消費和優惠券領取行為以及簡單

User_id 使用者ID
Merchant_id 商戶ID
Coupon_id ： null表示無優惠券消費，此時Discount_rate和Date_received欄位無意義。"fixed"表示該交易時限時低價活動
Discount_rate 優惠率: \(x\in [0,1]\)代表折扣率； x:y表示滿x減y;fixed表示低價限時優惠
Distance :user經常活動的地點離該merchant的最近門店距離時x*500米(如果是連鎖店，則取最近的一家門店), \(x\in[0,10]\); null表示無此資訊，0表示低於500米，10表示大於5公里
Date_received 領取優惠券時間消費日期：如果Date=null & Coupon_id != null，該記錄表示領取優惠券但沒有使用；
Date:消費日期：如果Date=null & Coupon_id != null，該記錄表示領取優惠券但沒有使用；如果Date!=null & Coupon_id = null，則表示普通消費日期；如果Date!=null & Coupon_id != null，則表示用優惠券消

4. 簡單的特徵工程及資料處理

將滿xx減yy型別(xx:yy)的券變成優惠率 :\(1 - \frac{yy}{xx}\)，同時提取出優惠券相關的三個新的特徵 discount_rate, discount_man, discount_jian, discount_type
將距離 str 轉為 int convert Discount_rate and Distance
補充Null值

def convertRate(row):
    # 將滿xx減yy變成折扣率
    """Convert discount to rate"""
    if pd.isnull(row):
        return 1.0
    elif ':' in str(row):
        rows = row.split(':')
        return 1.0 - float(rows[1])/float(rows[0])
    else:
        return float(row)

# 從discount_rate中提取三個新的特徵，把滿xx減yy的xx和yy各自作為兩個特徵，是否有優惠券作為一個特徵。
def getDiscountMan(row):
    if ':' in str(row):
        rows = row.split(':')
        return int(rows[0])
    else:
        return 0
def getDiscountJian(row):
    if ':' in str(row):
        rows = row.split(':')
        return int(rows[1])
    else:
        return 0
    
def getDiscountType(row):
    # 對優惠率特徵進行處理,返回的是空、1(有優惠)、0(沒有優惠)
    if pd.isnull(row):
        return np.nan
    elif ':' in row: # 則代表存在折扣
        return 1
    else:
        return 0    

def processData(df):
    # convert discunt_rate
    df['discount_rate'] = df['Discount_rate'].apply(convertRate)
    df['discount_man'] = df['Discount_rate'].apply(getDiscountMan)
    df['discount_jian'] = df['Discount_rate'].apply(getDiscountJian)
    df['discount_type'] = df['Discount_rate'].apply(getDiscountType)
    print("處理完後discount_rate的唯一值為:",df['discount_rate'].unique())
    # convert distance
    # 用-1填充，並轉換成int型別
    df['distance'] = df['Distance'].fillna(-1).astype(int)
    return df
dfoff = processData(dfoff)
dftest = processData(dftest)

print("tool is ok.")

當處理到這裡的時候你可以自己嘗試去視覺化優惠率區間的一個頻率直方圖。

5. 繼續觀察Data_received、Data的特徵並進行以下處理

# 觀察Date_received、Date特徵並進行以下處理:

提取出date_received和date的唯一值並進行排序
提出兩個新的特徵：couponbydate和buybydate
當使用者有優惠券時，通過領取優惠券時間分組時每個日期的數量
當使用者消費並且領取了優惠券的時候，通過領取優惠券時間分組時每個日期的數量
將其轉換為年月日的時間序列
通過轉換後的時間序列，提取週一到週日新特徵weekday_type
對weekday-type進行one-hot編碼
提取標籤y，-1表示沒有領取優惠券，1表示15天內進行過消費（沒有很好的考慮到那些沒有優惠券且進行消費的人）

# 對領域優惠券時間的特徵進行處理
date_received = dfoff['Date_received'].unique()
date_received = sorted(date_received[pd.notnull(date_received)]) # 提取出非空值的時間，並排序

# 對消費日期的特徵進行處理
date_buy = dfoff['Date'].unique()
date_buy = sorted(date_buy[pd.notnull(date_buy)])
date_buy = sorted(dfoff[dfoff['Date'].notnull()]['Date'])

# 當使用者有優惠券時，通過領取優惠券時間分組，並計算數量。提取為新的特徵。
couponbydate = dfoff[dfoff['Date_received'].notnull()][['Date_received', 'Date']].groupby(['Date_received'], as_index=False).count()
couponbydate.columns = ['Date_received','count']

# 當使用者消費並且領取了優惠券的時候，通過領取優惠券時間分組，並計算數量。提取為新的特徵。
buybydate = dfoff[(dfoff['Date'].notnull()) & (dfoff['Date_received'].notnull())][['Date_received', 'Date']].groupby(['Date_received'], as_index=False).count()
buybydate.columns = ['Date_received','count']

def getWeekday(row):
    # 轉換為年月日的時間序列
    if row == 'nan':
        return np.nan
    else:
        return date(int(row[0:4]), int(row[4:6]), int(row[6:8])).weekday() + 1
dfoff['weekday'] = dfoff['Date_received'].astype(str).apply(getWeekday)
dftest['weekday'] = dftest['Date_received'].astype(str).apply(getWeekday)
# weekday_type :  週六和週日為1，其他為0
dfoff['weekday_type'] = dfoff['weekday'].apply(lambda x : 1 if x in [6,7] else 0 )
dftest['weekday_type'] = dftest['weekday'].apply(lambda x : 1 if x in [6,7] else 0 )

# 對weekday_type進行one-hot編碼
weekdaycols = ['weekday_' + str(i) for i in range(1,8)]
tmpdf = pd.get_dummies(dfoff['weekday'].replace('nan', np.nan)) # one-hot編碼
tmpdf.columns = weekdaycols
dfoff[weekdaycols] = tmpdf

tmpdf = pd.get_dummies(dftest['weekday'].replace('nan', np.nan))
tmpdf.columns = weekdaycols
dftest[weekdaycols] = tmpdf


def label(row):
    if pd.isnull(row['Date_received']):
        return -1
    if pd.notnull(row['Date']):
        td = pd.to_datetime(row['Date'], format='%Y%m%d') -  pd.to_datetime(row['Date_received'], format='%Y%m%d')
        if td <= pd.Timedelta(15, 'D'):
            return 1
    return 0
dfoff['label'] = dfoff.apply(label, axis = 1)


print("end")

6. 視覺化處理後的線下資料的相關係數圖

corr = dfoff.corr()
print(corr)
plt.subplots(figsize=(16, 16))
sns.heatmap(corr, vmax=.8, square=True, annot=True)

7. 劃分訓練集和驗證集

# 根據使用者領取優惠券的日期劃分為 訓練集、驗證集
print("-----data split------")
df = dfoff[dfoff['label'] != -1].copy()
train = df[(df['Date_received'] < 20160516)].copy()
valid = df[(df['Date_received'] >= 20160516) & (df['Date_received'] <= 20160615)].copy()
print("end")

8. 使用SGD隨機梯度下降演算法

# feature 使用線性模型SGD方法
model = SGDClassifier(#lambda:
    loss='log',
    penalty='elasticnet',
    fit_intercept=True,
    max_iter=100,
    shuffle=True,
    alpha = 0.01,
    l1_ratio = 0.01,
    n_jobs=-1,
    class_weight=None
)
model.fit(train[original_feature], train['label'])
# #### 預測以及結果評價
print(model.score(valid[original_feature], valid['label']))
print("---save model---")
with open('1_model.pkl', 'wb') as f:
    pickle.dump(model, f)
with open('1_model.pkl', 'rb') as f:
    model = pickle.load(f)
   
# 儲存要提交的csv檔案
y_test_pred = model.predict_proba(dftest[original_feature])
dftest1 = dftest[['User_id','Coupon_id','Date_received']].copy()
dftest1['label'] = y_test_pred[:,1]
dftest1.to_csv('submit1.csv', index=False, header=False)
dftest1.head()

9. 使用500個決策樹模型整合，每次從資料集中隨機取樣100個訓練例項

# 使用500個決策樹模型整合，每次從資料集中隨機取樣100個訓練例項
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier

original_feature = ['discount_rate','discount_type','discount_man', 'discount_jian','distance', 'weekday', 'weekday_type'] + weekdaycols
print("----train-----")
model = BaggingClassifier(
    DecisionTreeClassifier(),n_estimators=500,max_samples=100,bootstrap=True,n_jobs=-1
)
model.fit(train[original_feature], train['label'])

# #### 預測以及結果評價
print(model.score(valid[original_feature], valid['label']))

print("---save model---")
with open('1_model.pkl', 'wb') as f:
    pickle.dump(model, f)
with open('1_model.pkl', 'rb') as f:
    model = pickle.load(f)

# test prediction for submission
y_test_pred = model.predict_proba(dftest[original_feature])
dftest1 = dftest[['User_id','Coupon_id','Date_received']].copy()
dftest1['label'] = y_test_pred[:,1]
dftest1.to_csv('submit2.csv', index=False, header=False)
dftest1.head()

這種演算法相比於上個SGD演算法，在天池提交上上升了千分之二個點。

10. 以Boosting + 網格搜尋為例

# 以Boosting方法
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier(
    max_depth=2,
    n_estimators=100, # 太小容易欠擬合，太大容易過擬合
    learning_rate=0.1)
model.fit(train[original_feature], train['label'])

# 使用網格搜尋的方法調參，雖然線上的成績沒有太大的上升，但是過擬合的情況得到了很大的改善。
from sklearn.model_selection import GridSearchCV
param_test1 = {'n_estimators':range(20,81,10)}
gsearch1 = GridSearchCV(
    estimator = GradientBoostingClassifier(
        learning_rate=0.1, min_samples_split=300,
        min_samples_leaf=20,
        max_depth=8,
        max_features='sqrt', 
        subsample=0.8,
        random_state=10), 
        param_grid = param_test1, scoring='roc_auc',cv=5,n_jobs=-1)
gsearch1.fit(train[original_feature], train['label'])

# gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_
print(gsearch1.score(valid[original_feature], valid['label']))
print("---save model---")
with open('1_model.pkl', 'wb') as f:
    pickle.dump(gsearch1, f)
with open('1_model.pkl', 'rb') as f:
    model = pickle.load(f)

# test prediction for submission
y_test_pred = gsearch1.predict_proba(dftest[original_feature])
dftest1 = dftest[['User_id','Coupon_id','Date_received']].copy()
dftest1['label'] = y_test_pred[:,1]
dftest1.to_csv('submit6.csv', index=False, header=False)
dftest1.head()

4. 總結

本文以天池新人賽o2o優惠券使用預測賽題為題，在對資料探索和分析後，本文主要對Discount_rate、Distance、Data_received、Date特徵進行處理，將其轉換成模型訓練能夠使用的數值型資料，並在此基礎上提取出了新的特徵discount_rate、discount_man、discount_jian、couponbydate、buybydate和weekday-type這些特徵，同時我們將weekday-type特徵進行one-hot編碼，最後我們以是否領取優惠券、是否在15天內進行消費提取標籤y的特徵為-1和1。

根據實驗結果來看，我們在進行特徵提取後，從一開始的SGD模型、Bagging模型所出現的過擬合情況，到Boosting模型+網格搜尋極大的緩解了模型的過擬合情況，說明較為適合的還是Boosting模型，它能夠很好的利用分類器的殘差來作為新的訓練集，進而實現更優的模型。（當然我覺得SGD一定也是可以的，不過得調整一些引數，本次實驗中對於SGD的使用太簡陋了）

同時，由於處理時間較為倉促，我們的模型還有很多能夠提高的地方。首先是特徵處理方面，我們還可以使用聚類或者主成分分析的方法判斷各個特徵的相關性，將相關性較大的特徵進行降維，同時我們還可以深層次的去探索特徵關係，進行特徵工程的建立。比如我們可以結合使用者線上、線下的相關特徵以及使用者-商家的互動特徵等。除此之外，我們在訓練集上label的選擇也是有待完善的，我們只是單獨考慮了沒有優惠券的為-1，15天購買的為1，並沒有考慮到沒有又回去但是15天內購買的情況以及其它複雜情況。

進一步的，在選擇特徵的時候，根據天池某位大佬的思路，我們可以考慮使用過擬合訓練的方法，使用100%資料集訓練，使用100%資料集測試，觀察auc，當auc距離1越遠的時候，說明特徵不夠多，繼續探索更多的特徵，直到這裡的auc接近1，在過擬合訓練完成後輸出特徵重要性，刪除特徵重要性低的特徵，不斷過擬合訓練，保持auc基本不變，最後得到是儘可能少的特徵數量但是又能夠表示這批資料的特性。

對於資料集的劃分，此次實驗中也做的很簡潔，只是單純的根據時間順序劃分為兩個資料集，而這樣劃分資料集的劣勢很大（但是由於時間倉促，這樣最簡單）。如果這段時間內外界環境有較大波動則很可能對資料集樣本的正負性產生極大的影響。後期會嘗試在一開始的資料集劃分時就採用交叉驗證，而不僅僅是在模型訓練時採用交叉驗證，這樣可能充分利用資料集的所有訊息，提高模型的泛化能力。

在模型建立方面後期也應該更多的去嘗試Xgboost整合模型。整合模型的“三個臭皮匠頂個諸葛亮”的理念是十分成功的，很多上分成功的大佬也是用到了Xgboost模型的方法。同時我將進一步使用網格搜尋的方法，每次網格搜尋都將根據上一次搜尋出的結果來縮小範圍，最後確定最優的引數。

Anyway，整篇文章的定位是小白入門級，大致瞭解一下資料處理與清洗、特徵工程啊、模型訓練以及非常NB的整合學習。如果你想在這個比賽裡刷更高的分數，建議去天池論壇裡找一下其它大佬分享的文章~

資料集、程式碼我把它放到了網盤裡，大家有需要可以自提：連結：https://pan.baidu.com/s/1CZB8fErDygtdFc5TWvIk9w 提取碼：er1b

天池新人實戰賽o2o優惠券使用預測-排名181
2020-11-08
天池 O2O 優惠券使用預測思路解析與程式碼實戰
2018-11-08
《阿里雲天池大賽賽題解析》——O2O優惠卷預測
2022-06-23
阿里
618新人199IT知識星球優惠券
2020-06-18
資料探勘比賽預備知識
2020-11-03
優惠券採集資訊
2023-11-14
資料探勘-預測模型彙總
2020-11-08
模型
RocketMQ實現優惠券秒殺
2024-11-05
MQ
阿里雲最新優惠券領取及續費優惠券
2019-03-19
阿里
營銷模組資料庫表解析：優惠券功能
2019-08-15
資料庫
資料探勘與預測分析(第2版)
2018-10-25
資料探勘之產品預測任務
2019-05-25
flutter實現類似優惠券樣式
2019-10-22
Flutter
阿里雲優惠券領取及使用方法
2019-02-23
阿里
2019阿里雲最新優惠券
2019-03-01
阿里
JAVA分散式優惠券系統後臺手把手實戰開發
2019-03-14
Java分散式
基於XGBoost模型的幸福度預測——阿里天池學習賽
2020-12-20
模型阿里
安卓自定義優惠券View
2018-10-23
安卓View
阿里雲代金券 | 阿里雲優惠券 |阿里雲優惠碼|雲伺服器|阿里雲
2019-02-27
阿里伺服器
天池FashionAI全球挑戰賽小小嚐試
2019-02-27
AI
資料探勘實踐（金融風控）：金融風控之貸款違約預測挑戰賽（上篇）[xgboots/lightgbm/Catboost等模型]--模型融合：stacking、blending
2023-05-17
boot模型
阿里雲資料庫RDS福利--阿里雲優惠券免費領，訂單額大於代金券即可使用
2019-02-23
阿里資料庫
阿里雲全民雲端計算優惠券及官方優惠彙總
2019-03-01
阿里
第一屆天池 PolarDB 資料庫效能大賽
2019-02-02
資料庫
day08-優惠券秒殺04
2023-04-28
day05-優惠券秒殺01
2023-04-24
php微信掃碼領優惠券
2019-05-11
PHP
【機器學習入門與實踐】合集入門必看系列，含資料探勘專案實戰，適合新人入門
2023-04-17
機器學習
阿里雲伺服器代金券怎麼個優惠法？為什麼代金券是最實惠的
2019-03-19
阿里伺服器
Python爬蟲之Js逆向案例-拼多多商品詳情資料&商品列表資料&商品優惠券資料
2023-04-20
Python爬蟲JS
如何實現千萬級優惠文章的優惠資訊同步
2023-01-31
優惠券系統應該如何設計？
2018-11-01
談談優惠券系統的設計
2018-05-24
Hadoop大資料探勘從入門到進階實戰
2018-06-22
Hadoop大資料
當資料探勘遇上戰略決策
2023-07-11
《資料探勘導論》實驗課——實驗四、資料探勘之KNN,Naive Bayes
2019-06-21
KNNAI
vivo 全球商城：優惠券系統架構設計與實踐
2021-08-09
架構
1688商品詳情資料介面、商品列表介面，商品屬性介面、商品優惠券介面
2023-03-02