Python做點選率資料預測

TechSynapse發表於2024-06-18

原文網址 : https://www.cnblogs.com/TS86/p/18254757

點選率（Click-Through Rate, CTR）預測是推薦系統、廣告系統和搜尋引擎中非常重要的一個環節。在這個場景中，我們通常需要根據使用者的歷史行為、物品的特徵、上下文資訊等因素來預測使用者點選某個特定物品（如廣告、推薦商品）的機率。

1.點選率資料預測

以下是一個簡化的點選率預測示例，使用Python的機器學習庫scikit-learn。請注意，實際生產中的點選率預測模型通常會更復雜，並可能涉及深度學習框架如TensorFlow或PyTorch。

1.1 資料準備

首先，我們需要一個包含使用者特徵、物品特徵和點選情況的資料集。這裡為了簡化，我們假設有一個包含使用者ID、物品ID和是否點選（0或1）的資料集。

import pandas as pd  
from sklearn.model_selection import train_test_split  
from sklearn.preprocessing import LabelEncoder, OneHotEncoder  
from sklearn.compose import ColumnTransformer  
from sklearn.pipeline import Pipeline  
from sklearn.linear_model import LogisticRegression  
from sklearn.metrics import roc_auc_score  
  
# 假設的資料  
data = {  
    'user_id': ['A', 'B', 'C', 'A', 'B', 'C'],  
    'item_id': [1, 2, 3, 2, 3, 1],  
    'clicked': [1, 0, 1, 1, 0, 1]  
}  
df = pd.DataFrame(data)  
  
# 拆分特徵和標籤  
X = df[['user_id', 'item_id']]  
y = df['clicked']  
  
# 劃分訓練集和測試集  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

1.2 特徵工程

由於使用者ID和物品ID通常是類別型變數，我們需要將其轉換為數值型變數。這裡我們使用LabelEncoder和OneHotEncoder。但為了簡化，我們假設使用者ID和物品ID的數量不多，可以直接使用獨熱編碼。

# 特徵工程：將類別變數轉換為獨熱編碼  
categorical_features = ['user_id', 'item_id']  
categorical_transformer = Pipeline(steps=[  
    ('onehot', OneHotEncoder(handle_unknown='ignore'))  
])  
  
# 定義預處理步驟  
preprocessor = ColumnTransformer(  
    transformers=[  
        ('cat', categorical_transformer, categorical_features)  
    ])

1.3 模型訓練

我們使用邏輯迴歸作為預測模型。

# 定義模型  
model = Pipeline(steps=[('preprocessor', preprocessor),  
                        ('classifier', LogisticRegression(solver='liblinear', max_iter=1000))])  
  
# 訓練模型  
model.fit(X_train, y_train)

1.4 模型評估

我們使用AUC-ROC作為評估指標。

# 預測  
y_pred_prob = model.predict_proba(X_test)[:, 1]  
  
# 計算AUC-ROC  
auc = roc_auc_score(y_test, y_pred_prob)  
print(f'AUC-ROC: {auc}')

1.5 注意事項和擴充套件

（1）特徵工程：在實際應用中，特徵工程是至關重要的一步，它涉及到如何有效地從原始資料中提取出對預測有用的資訊。

（2）模型選擇：邏輯迴歸是一個簡單且有效的模型，但對於更復雜的場景，可能需要使用更復雜的模型，如深度學習模型。

（3）超引數最佳化：在訓練模型時，超引數的選擇對模型的效能有很大影響。可以使用網格搜尋、隨機搜尋等方法來最佳化超引數。

（4）實時更新：點選率預測模型通常需要實時更新以反映最新的使用者行為和物品特徵。

（5）評估指標：除了AUC-ROC外，還可以使用其他評估指標，如準確率、召回率、F1分數等，具體取決於業務需求。

2. 點選率資料預測模型訓練和預測的詳細步驟

當涉及到更詳細的程式碼示例時，我們需要考慮一個稍微複雜一點的場景，其中包括更多的特徵處理步驟和更具體的模型訓練及預測流程。以下是一個更完整的示例，它展示瞭如何處理分類特徵、數值特徵（如果有的話），並使用邏輯迴歸進行點選率預測。

2.1 資料準備

首先，我們模擬一個包含分類特徵和數值特徵的資料集。

import pandas as pd  
from sklearn.model_selection import train_test_split  
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler  
from sklearn.compose import ColumnTransformer  
from sklearn.pipeline import Pipeline  
from sklearn.linear_model import LogisticRegression  
from sklearn.metrics import roc_auc_score  
  
# 假設的資料  
data = {  
    'user_id': ['A', 'B', 'C', 'A', 'B', 'C'],  
    'item_id': [1, 2, 3, 2, 3, 1],  
    'user_age': [25, 35, 22, 28, 32, 27],  # 假設的數值特徵  
    'clicked': [1, 0, 1, 1, 0, 1]  
}  
df = pd.DataFrame(data)  
  
# 拆分特徵和標籤  
X = df.drop('clicked', axis=1)  
y = df['clicked']  
  
# 劃分訓練集和測試集  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

2.2 特徵工程

我們將使用ColumnTransformer來處理不同的特徵型別。

# 定義分類特徵和數值特徵  
categorical_features = ['user_id', 'item_id']  
numeric_features = ['user_age']  
  
# 預處理分類特徵  
categorical_preprocessor = Pipeline(steps=[  
    ('labelencoder', LabelEncoder()),  # 將字串轉換為整數  
    ('onehotencoder', OneHotEncoder(handle_unknown='ignore', sparse=False))  # 獨熱編碼  
])  
  
# 預處理數值特徵  
numeric_preprocessor = Pipeline(steps=[  
    ('scaler', StandardScaler())  # 標準化處理  
])  
  
# 合併預處理步驟  
preprocessor = ColumnTransformer(  
    transformers=[  
        ('cat', categorical_preprocessor, categorical_features),  
        ('num', numeric_preprocessor, numeric_features)  
    ]  
)

2.3 模型訓練和評估

# 定義模型  
model = Pipeline(steps=[  
    ('preprocessor', preprocessor),  
    ('classifier', LogisticRegression(solver='liblinear', max_iter=1000))  
])  
  
# 訓練模型  
model.fit(X_train, y_train)  
  
# 預測機率  
y_pred_prob = model.predict_proba(X_test)[:, 1]  
  
# 評估模型  
auc = roc_auc_score(y_test, y_pred_prob)  
print(f'AUC-ROC: {auc}')  
  
# 預測類別（通常對於二分類問題，閾值設為0.5）  
y_pred = (y_pred_prob >= 0.5).astype(int)  
  
# 評估準確率（注意：準確率可能不是最佳的評估指標，特別是對於不平衡的資料集）  
accuracy = (y_pred == y_test).mean()  
print(f'Accuracy: {accuracy}')

2.4 預測新資料

一旦模型訓練完成並且效能滿足要求，我們就可以使用它來預測新資料的點選率。

# 假設我們有新的資料  
new_data = pd.DataFrame({  
    'user_id': ['D', 'E'],  
    'item_id': [2, 3],  
    'user_age': [30, 20]  
})  
  
# 預測新資料的點選機率  
new_data_pred_prob = model.predict_proba(new_data)[:, 1]  
print(f'Predicted click probabilities for new data: {new_data_pred_prob}')

請注意，這個示例是為了教學目的而簡化的。在實際應用中，特徵工程可能更加複雜，並且可能需要考慮更多的因素，如時間因素、上下文資訊、使用者行為序列等。此外，模型的選擇和調優也是非常重要的步驟，以確保預測的準確性。

3.具體的模型訓練和預測步驟

當涉及到具體的模型訓練和預測步驟時，以下是一個基於Python和scikit-learn的更詳細的流程。這個流程假設我們已經有了一個處理好的資料集，其中包含了特徵（可能是分類的、數值的或者兩者的混合）和目標變數（即點選率）。

3.1 匯入所需的庫和模組

首先，我們需要匯入所有必要的庫和模組。

import pandas as pd  
from sklearn.model_selection import train_test_split  
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler  
from sklearn.compose import ColumnTransformer  
from sklearn.pipeline import Pipeline  
from sklearn.linear_model import LogisticRegression  
from sklearn.metrics import roc_auc_score  
  
# 假設你已經有了處理好的DataFrame 'df'，其中包含了特徵和標籤

3.2 資料準備

假設你已經有了一個名為df的pandas DataFrame，其中包含了特徵和目標變數。

# 假設df是你的資料集，且已經包含了特徵和標籤  
# X 是特徵，y 是標籤  
X = df.drop('clicked', axis=1)  # 假設'clicked'是目標變數列名  
y = df['clicked']  
  
# 劃分訓練集和測試集  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

3.3 特徵工程

根據特徵的型別（分類或數值），我們需要分別處理它們。

# 定義分類特徵和數值特徵  
categorical_features = ['user_id', 'item_id']  # 假設這些是分類特徵  
numeric_features = ['user_age', 'other_numeric_feature']  # 假設這些是數值特徵  
  
# 預處理分類特徵  
categorical_preprocessor = Pipeline(steps=[  
    ('labelencoder', LabelEncoder()),  # 將字串轉換為整數  
    ('onehotencoder', OneHotEncoder(handle_unknown='ignore', sparse=False))  # 獨熱編碼  
])  
  
# 預處理數值特徵  
numeric_preprocessor = Pipeline(steps=[  
    ('scaler', StandardScaler())  # 標準化處理  
])  
  
# 合併預處理步驟  
preprocessor = ColumnTransformer(  
    transformers=[  
        ('cat', categorical_preprocessor, categorical_features),  
        ('num', numeric_preprocessor, numeric_features)  
    ]  
)

3.4 模型訓練

現在我們可以定義並訓練模型了。

# 定義模型  
model = Pipeline(steps=[  
    ('preprocessor', preprocessor),  
    ('classifier', LogisticRegression(solver='liblinear', max_iter=1000))  
])  
  
# 訓練模型  
model.fit(X_train, y_train)

3.5 模型評估

使用測試集來評估模型的效能。

# 預測機率  
y_pred_prob = model.predict_proba(X_test)[:, 1]  
  
# 計算AUC-ROC  
auc = roc_auc_score(y_test, y_pred_prob)  
print(f'AUC-ROC: {auc}')  
  
# 預測類別（通常對於二分類問題，閾值設為0.5）  
y_pred = (y_pred_prob >= 0.5).astype(int)  
  
# 評估準確率（注意：準確率可能不是最佳的評估指標，特別是對於不平衡的資料集）  
accuracy = (y_pred == y_test).mean()  
print(f'Accuracy: {accuracy}')

3.6 預測新資料