如何解決過度擬合

數量技術宅發表於2023-05-14

原文網址 : https://www.cnblogs.com/sljsz/p/17398804.html

更多精彩內容，歡迎關注公眾號：數量技術宅，也可新增技術宅個人微訊號：sljsz01，與我交流。

為何產生過度擬合

我們在做資料分析建模，或是量化策略回測的過程中，會模型在訓練時過度擬合了歷史資料（回測），導致在新資料上的預測（實盤）效果不佳。造成這種現象有以下幾種原因：

一是這可能是因為模型過於複雜，引數過多，使其可以輕鬆地擬合曆史資料，但在新資料上的預測能力較差。一起來看一個典型的過度擬合的例子。

import numpy as np
import matplotlib.pyplot as plt

# Generate some data
x = np.linspace(0, 10, 100)
y = np.sin(x) + np.random.normal(scale=0.1, size=100)

# Fit a polynomial of degree 20
p = np.polyfit(x, y, 20)
y_pred = np.polyval(p, x)

# Plot the data and the fitted polynomial
plt.scatter(x, y)
plt.plot(x, y_pred, color='red')
plt.show()

我們生成了一組sin函式加上隨機數的序列，同時用20階的高階函式去擬合這組樣本資料，如此高維的資料，必然會產生樣本內擬合極度優秀的效果。而模型在學習訓練資料時過度擬合了資料的細節，導致模型過於複雜，也就失去了泛化能力。

另一個可能的原因是樣本選擇偏差。如果使用的歷史資料不足以代表未來的變化，那麼模型在訓練時就會過度擬合，從而在實際應用中無法正確地預測。例如採用15年8月以前資料測試的股指期貨短週期\高頻策略，由於股指期貨手續費改變造成的市場結構劇變，除非手續費重新恢復，否則資料偏差極易產生未來績效的偏差。

如何解決過度擬合問題

為了解決資料分析、量化策略構建過程中最常見的過擬合問題，我們需要採取一些措施。

交叉驗證

資料集劃分是避免過度擬合問題的關鍵步驟之一。在構建量化策略時，我們通常將資料集劃分為訓練集和測試集。訓練集用於訓練模型，測試集用於驗證模型的泛化能力。為了避免過度擬合問題，我們可以使用交叉驗證的方法，將資料集分成10份，每次選取其中一份作為測試集，其他九份作為訓練集。這樣可以更好地驗證模型的泛化能力，避免過度擬合。

下面是一個使用sklearn庫做10折交叉驗證的例子，呼叫cross_val_score方法，生成Cross-validation scores、Mean score、Standard deviation等交叉驗證結果並展示。

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression

# Generate some data
x = np.linspace(0, 10, 100)
y = np.sin(x) + np.random.normal(scale=0.1, size=100)

# Fit a polynomial of degree 20
p = np.polyfit(x, y, 20)
y_pred = np.polyval(p, x)

# Use linear regression with cross-validation to evaluate the model
lr = LinearRegression()
scores = cross_val_score(lr, x.reshape(-1, 1), y, cv=10)

# Print the mean score and standard deviation
print("Cross-validation scores:", scores)
print("Mean score:", np.mean(scores))
print("Standard deviation:", np.std(scores))

'''
output
Cross-validation scores: [-7.06062585e+00 -4.33284120e-04 -2.57612012e+01 -2.13349644e+00
 -6.45893114e-01]
Mean score: -7.120329969404656
Standard deviation: 9.643292108330295
'''

擴充訓練資料集

下面示例，展示了不同的訓練資料集大小對於模型預測能力的影響。圖表中的橫軸表示訓練集的大小，縱軸表示模型在測試集上的預測誤差。可以看到，當訓練集較小時，模型在測試集上的預測誤差較大，而當訓練集較大時，模型的預測誤差較小。這表明，訓練資料集大小對於模型的預測能力具有重要影響。

因此，在資料執行範圍內，我們應該儘可能選擇更大的訓練（回測）資料集，例如我們如果構建中低頻策略，可以使用5年、甚至10年以上的資料集進行訓練，而構建高頻策略，也需要儘可能大的訓練集，具體以可獲取的Tick、OrderBook資料範圍為準。

特徵選擇與正則化

特徵選擇是避免過度擬合問題的另一個關鍵。在資料、策略的建模過程中，我們通常會有許多備選特徵可以用於建模預測，如果選擇所有的備選特徵進行建模，會導致模型過於複雜，容易過度擬合。因此，我們透過正則化篩選最重要的特徵，減少特徵數量，降低過度擬合的風險。

下面是一個正則化避免過度擬合的例子（省略載入資料步驟），我們建立邏輯迴歸模型，並使用L1正則化來選擇重要特徵。然後使用所選擇的特徵來擬合邏輯迴歸模型，在測試集上評估該模型。這種方法透過只選擇最重要的特徵來避免過擬合，從而降低了模型的複雜性並提高了其泛化效能。

# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel

# Load data ......

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data.drop('target', axis=1), data['target'], test_size=0.2, random_state=42)

# Create logistic regression model
lr = LogisticRegression()

# Use L1 regularization to select important features
selector = SelectFromModel(estimator=lr, threshold='1.25*median')
selector.fit(X_train, y_train)

# Transform training and testing sets to include only important features
X_train_selected = selector.transform(X_train)
X_test_selected = selector.transform(X_test)

# Fit logistic regression model on selected features
lr_selected = LogisticRegression()
lr_selected.fit(X_train_selected, y_train)

# Evaluate model performance on testing set
print('Accuracy on testing set:', lr_selected.score(X_test_selected, y_test))