文章說明

本文是我上個學期選修的一門人工智慧與交通運輸課程的一個小作業的實驗報告的示例程式碼部分，原始檔為 Jupyter Notebook 格式。這份實驗報告是關於對一組微觀交通流量資料應用資料分析方法進行簡單的研究的，實現了多種不同的交通預測模型並進行了對比。

報告內容詳見我的部落格文章：我的人工智慧與交通運輸課程作業：交通流資料分析報告

文章說明
交通流分析示例程式碼
- 匯入庫
- 資料處理部分
- 資料的統計分佈特徵
- 資料的欠取樣
  - 格林希爾茨（Greenshields）線性模型
  - 一元線性迴歸
  - 格林希爾茲模型精度評價
- 三角圖模型
  - 劃分出兩段資料集
  - 畫出擬合線段
- S3 模型
- Multilayer Feed-forward Neural Network 多層前饋神經網路
  - 重新劃分資料集
- 訓練神經網路
  - 800 神經元，3 層
    - 計算預測值並評估模型精度
  - 400 神經元，4 層
  - 200 神經元，6 層
  - 100 神經元，10 層
  - 50 神經元，18 層
  - 神經網路預測效果比較

交通流分析示例程式碼

由於交通流分析研究的需要，交通流模型應當是可微的，同時模型的預測要儘可能與經驗觀測的結果相符合。

一個好的交通流模型通常具有以下特點：

函式形式簡單
模型函式在整個密度範圍內均有定義
模型引數具有實際的物理意義
與經驗觀測的結果在統計學上具有較好的符合性

匯入庫

numpy、pandas、matplotlib，常用的資料分析庫，稱之為 “資料分析三巨頭”或者“御三家”。

numpy 通常用於數學計算、矩陣運算、線性數學
pandas 用於處理資料二維表，用於資料探勘、資料分析
matplotlib 用於資料視覺化繪圖

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

如下程式碼可以用於解決 MatPlotLib 中文顯示的亂碼問題。

plt.rcParams.update({
    "figure.dpi":100,           # 設定圖片清晰度
    "font.sans-serif":'SimHei', # 字型為黑體，否則會出現中文亂碼
    "axes.unicode_minus":False, # 解決負號不顯示的問題
})

在 Jupyter Notebook 使用中發現，未新增 %matplotlib inline 時，所有的圖片均不在 Jupyter 中自動顯示；而如果在程式碼框中加入 plt.show()，會開啟額外的 tk 視窗，視窗顯示均為空白。

疑似本地 matplotlib 故障。多次重灌 matplotlib 及 matplotlib-inline，均未果。

import matplotlib
matplotlib.use('TkAgg')  # 或者其他合適的GUI庫名稱
%matplotlib inline

資料處理部分

讀取資料，使用方法 pandas.read_csv()

df = pd.read_csv("./data/M_SE.csv")
df

	Time	Week	Dis	Flow	Speed
0	0	1	0.00	72	73.5
1	1	1	0.00	54	74.3
2	2	1	0.00	49	73.1
3	3	1	0.00	62	73.0
4	4	1	0.00	47	70.7
...	...	...	...	...	...
8059	283	7	2.69	119	73.2
8060	284	7	2.69	139	72.2
8061	285	7	2.69	121	74.6
8062	286	7	2.69	102	74.0
8063	287	7	2.69	93	73.9

8064 rows × 5 columns

缺少密度資料。根據流量 - 密度 - 速度之間的關係：

\[k = \dfrac{q}{v} \]

可以計算出車流密度的大小。

df["Density"] = df["Flow"] / df["Speed"]
df["Density"]

0       0.979592
1       0.726783
2       0.670315
3       0.849315
4       0.664781
          ...   
8059    1.625683
8060    1.925208
8061    1.621984
8062    1.378378
8063    1.258457
Name: Density, Length: 8064, dtype: float64

df

	Time	Week	Dis	Flow	Speed	Density
0	0	1	0.00	72	73.5	0.979592
1	1	1	0.00	54	74.3	0.726783
2	2	1	0.00	49	73.1	0.670315
3	3	1	0.00	62	73.0	0.849315
4	4	1	0.00	47	70.7	0.664781
...	...	...	...	...	...	...
8059	283	7	2.69	119	73.2	1.625683
8060	284	7	2.69	139	72.2	1.925208
8061	285	7	2.69	121	74.6	1.621984
8062	286	7	2.69	102	74.0	1.378378
8063	287	7	2.69	93	73.9	1.258457

8064 rows × 6 columns

資料的統計分佈特徵

在開始進行資料的處理和分析之前，需要對資料分佈的統計學特徵有所瞭解。

檢視資料的統計特徵。

可以看見原始資料的密度在 6.152 左右，密度整體較高。

df[["Flow", "Speed", "Density"]].describe()

	Flow	Speed	Density
count	8064.000000	8064.000000	8064.000000
mean	376.938244	66.836297	6.152005
std	236.932452	9.715159	4.613613
min	13.000000	8.000000	0.194444
25%	129.000000	65.775000	1.795225
50%	403.000000	70.400000	5.768694
75%	591.000000	72.400000	9.193952
max	815.000000	76.800000	31.458333

繪製箱線圖，對資料樣本點統計分佈情況進行視覺化。

df.boxplot()

<AxesSubplot: >

檢驗資料的分佈是否符合正態性特徵。

from scipy.stats import kstest
alpha = 0.05
for cols in df[["Flow", "Density", "Speed"]]:
    data_to_KS = df[cols].values
    u = data_to_KS.mean() # 計算均值
    std = data_to_KS.std() # 計算標準差
    result = kstest( data_to_KS, 'norm', (u, std) )
    print(str(cols), '是否正態:', result[1] > alpha, 
        "; k 檢驗值：", result[1])

Flow 是否正態: False ; k 檢驗值： 8.095684176144447e-81
Density 是否正態: False ; k 檢驗值： 2.198859154425868e-70
Speed 是否正態: False ; k 檢驗值： 0.0

發現資料樣本不具有正態性，這說明原始資料取樣不均勻。

fig, axes = plt.subplots(3, 1)
df["Flow"].hist(bins=30, alpha=0.5, ax=axes[0]) 
df["Flow"].plot(kind='kde', secondary_y=True, ax=axes[0])
df["Density"].hist(bins=30, alpha=0.5, ax=axes[1])
df["Density"].plot(kind='kde', secondary_y=True, ax=axes[1])
df["Speed"].hist(bins=30, alpha=0.5, ax=axes[2])
df["Speed"].plot(kind='kde', secondary_y=True, ax=axes[2])

<AxesSubplot: >

檢驗資料樣本中兩兩變數之間的相關性。由於資料不具有正態性，因此採用斯皮爾曼相關性方法。

corr = df.corr(method='spearman')
sns.heatmap(corr, cmap="bwr", annot=True) 
#顯示對腳線下面部分圖

<AxesSubplot: >

接下來，繪製資料樣本點在各個速度下的分佈情況，檢視取樣是否均勻：

counts = df['Speed'].value_counts()
counts = counts.sort_index()
plt.bar(counts.index, counts.values)
plt.title("Sample Counts at each Speed")

Text(0.5, 1.0, 'Sample Counts at each Speed')

透過對不同速度下的車流的計數繪製柱狀圖可以看出，車流的樣本點幾乎完全集中在 65 到 75 這個區間之內。這就意味著，如果直接利用梯度下降演算法對模型進行擬合，就會導致對資料點進行擬合的時候梯度下降法為追求整體殘差最小，會優先照顧到這些資料量較大的點而忽略其他存在的點。

當前情形屬於資料分析中，遇到資料點取樣不均、迴歸方法優先照顧樣本數量較多的資料點而忽略取樣數量較少、但是對於解決實際問題很重要的資料點，以致模型擬合結果不理想的情況。

需要根據具體情況重新取樣，或者根據資料的分佈特徵劃分資料集。

對於上述資料分佈不均的情況，通常採用以下幾種方法：

過取樣：對少數類重新取樣，使得資料點較少的區域也能擁有較多樣本
欠取樣：篩選資料點密集區域的資料點，使這些資料點減少，與資料點較少區域的資料點的數量保持一致。
使用權重的梯度下降：在計算梯度時，考慮每個資料點的頻率或者密度，對數量較少的點給予更大的權重。
透過變換方法，對資料點進行消偏操作
選擇對樣本點分佈情況不敏感的模型進行迴歸

資料的欠取樣

此處資料重取樣的方法是對資料點進行篩選，檢查並刪除具有相等的速度或密度或流量值的樣本點，只保留其中的一個。

這樣做的效果是，可以使得同速度、密度、流量下的樣本點只保留一個。

匿名函式使用 sample() 方法從每個分組中隨機選擇指定數量（即 sample_size）的樣本點。最終得到一個包含均勻篩選取樣結果的新 DataFrame。

透過改變 col_names 列表中的列名，可以選擇性刪除重複的特定列具有重複值的樣本點。

resampled_df = df
col_names = ['Density', 'Speed', 'Flow'] # 按列名取樣
sample_size = 1 # 每個值都取相同數量的樣本點

for name in col_names:
    target_count = resampled_df[name].value_counts()
    # 對每個值進行均勻篩選取樣
    resampled_df = resampled_df.groupby(name).apply(
        lambda x: x.sample(n=sample_size)
        )
    # 重置索引並刪除多餘的索引列
    resampled_df.reset_index(drop=True, inplace=True)
resampled_df

	Time	Week	Dis	Flow	Speed	Density
0	32	1	0.00	18	51.2	0.351562
1	31	1	2.69	20	51.4	0.389105
2	26	3	0.00	21	64.4	0.326087
3	24	1	2.69	22	62.8	0.350318
4	36	1	0.00	23	61.9	0.371567
...	...	...	...	...	...	...
350	82	4	2.69	780	61.5	12.682927
351	205	1	2.69	781	55.9	13.971377
352	79	2	2.69	784	66.2	11.842900
353	94	4	2.69	808	53.7	15.046555
354	86	1	2.69	812	59.2	13.716216

355 rows × 6 columns

繪製出篩選後的資料如下圖所示：

resampled_df.plot.scatter( x = "Density", y = "Speed" )

<AxesSubplot: xlabel='Density', ylabel='Speed'>

由於是隨機篩選資料點，每一次執行程式碼之後正態性檢驗的結果均不同。但是相較之前其 k 檢驗值置信度已經大大增加。

from scipy.stats import kstest
alpha = 0.05
for cols in resampled_df[["Flow", "Density", "Speed"]]:
    data_to_KS = resampled_df[cols].values
    u = data_to_KS.mean() # 計算均值
    std = data_to_KS.std() # 計算標準差
    result = kstest( data_to_KS, 'norm', (u, std) )
    print(str(cols), '是否正態: ', result[1] > alpha, 
        '; k 檢驗值：', result[1])

Flow 是否正態:  False ; k 檢驗值： 0.0006183747274011923
Density 是否正態:  False ; k 檢驗值： 0.03517473477505051
Speed 是否正態:  False ; k 檢驗值： 0.02217574066080618

fig, axes = plt.subplots(3, 1)
resampled_df["Flow"].hist( 
    bins = 30, alpha = 0.5, ax=axes[0])
resampled_df["Flow"].plot( 
    kind = 'kde', secondary_y = True, ax=axes[0])
resampled_df["Density"].hist( 
    bins = 30, alpha = 0.5, ax=axes[1])
resampled_df["Density"].plot( 
    kind = 'kde', secondary_y = True, ax=axes[1])
resampled_df["Speed"].hist( 
    bins = 30, alpha = 0.5, ax=axes[2])
resampled_df["Speed"].plot( 
    kind = 'kde', secondary_y = True, ax=axes[2])

<AxesSubplot: >

常見的定義在密度 - 速度關係上的交通流模型有：

格林希爾茲（Greenshields）線性模型
迪克（Dick. A. C）折線圖模型
格林伯格（Greenberg）對數（擴充套件）模型
安德伍德（Underwood）指數模型

透過比較多個模型的擬合效果，發現重取樣後的格林希爾茲模型的擬合效果最好。

格林希爾茨（Greenshields）線性模型

假設速度 - 密度模型為線性模型，當交通密度升高時，速度就呈現下降的趨勢。

\[v = v_f - \left[ \dfrac{v_f}{k_j} \right] \cdot k \]

\(u_f\) 是暢行速度，也就是車流密度趨於零車輛可以暢行無阻的平均速度
\(k_j\) 是阻塞密度，也就是車流密集到所有車輛都無法移動時的速度

也就相當於線性模型為：

\[v = a - bk \\ \]

其中，\(\begin{cases}a = v_f, \\ b = \left[ \dfrac{v_f}{k_j} \right] \end{cases}\)。透過對模型進行一元線性擬合，我們可以求解出 \(a\) 與 \(b\)，從而推到就可以知道 \(u_f\) 與 \(k_j\) 的值。

對篩選的資料進行相關性檢驗：

corr = resampled_df.corr(method='spearman')
sns.heatmap(corr, cmap="bwr", annot=True) 
#顯示對腳線下面部分圖

<AxesSubplot: >

可以看到，這樣對資料點的篩選方法導致了流量和速度以及密度的相關性的損失，但是提高了速度和密度之間的相關性，由原來的 -0.76 提升到了 -0.92 左右。

在已經劃分出的兩段資料集中，我們可以進一步劃分出學習集和測試集。

由於交通流密度資料沒有時序性，採用直接取資料點末段為測試集的做法不嚴謹。

sklearn.model 模組提供了函式 train_test_seperate() 用於劃分訓練集和測試集。測試集的尺寸可以用引數 test_size 控制。

sklearn.model_selection.train_test_split(*arrays, test_size=None, train_size=None, random_state=None, shuffle=True, stratify=None)[source]

Split arrays or matrices into random train and test subsets.

Quick utility that wraps input validation, next(ShuffleSplit().split(X, y)), and application to input data into a single call for splitting (and optionally subsampling) data into a one-liner.

具體的用法可以參考 sklearn.model_selection.train_test_split

我們這裡取原始資料長度的 20% 為測試集資料長度，即 test_size = 0.2

from sklearn.model_selection import train_test_split
rs_train_df, rs_test_df = train_test_split(
    resampled_df, test_size=0.2)
rs_train_X = rs_train_df[["Density"]].values
rs_train_Y = rs_train_df[["Speed"]].values
rs_test_X = rs_test_df[["Density"]].values
rs_test_Y = rs_test_df[["Speed"]].values

下圖展示了學習機和測試集的劃分結果：

fig, ax = plt.subplots()
rs_train_df.plot.scatter( x = "Density", y = "Speed", 
    label = "Training Data", c = "C0", ax = ax )
rs_test_df.plot.scatter( x = "Density", y = "Speed", 
    label = "Test Data", c = "C9", ax = ax )

<AxesSubplot: xlabel='Density', ylabel='Speed'>

一元線性迴歸

Python 中常見的一元線性迴歸的實現方法是 sklearn.linear_model.LinearRegression()，這是一個專用的一元線性迴歸方法。具有使用便捷靈活的優點，但是缺點是無法輸出模型的 Summary 統計資訊等。

我們這裡選擇 statsmodels.api 中的 OLS 工具包。用法如下：

statsmodels.formula.api.ols(formula, data, subset=None, drop_cols=None, *args, **kwargs)¶
Create a Model from a formula and dataframe.

具體的用法可以參考 statsmodels.formula.api.ols

OLS 法透過一系列的預測變數來預測響應變數（也可以說是在預測變數上回歸響應變數）。

線性迴歸是指對引數 \(\beta\) 為線性的一種迴歸（即引數只以一次方的形式出現）模型：

\(Y_t = \alpha + \beta x_t + \mu_t, (t = 1...... n)\) 表示觀測數。

\(Y_t\) 被稱作因變數
\(x_t\) 被稱作自變數
\(\alpha, \beta\) 為需要最小二乘法去確定的引數，或稱迴歸係數
\(\mu_t\) 為隨機誤差項

import statsmodels.api as sm
# 加入截距 
rs_train_X_with_bias = sm.add_constant(rs_train_X)
Greenshields_model = sm.OLS( # 注意引數排列的順序
    rs_train_Y, rs_train_X_with_bias
).fit() 
# 輸出結果到螢幕
Greenshields_model.summary()

OLS Regression Results
Dep. Variable:	y	R-squared:	0.814
Model:	OLS	Adj. R-squared:	0.813
Method:	Least Squares	F-statistic:	1235.
Date:	Mon, 16 Oct 2023	Prob (F-statistic):	4.89e-105
Time:	16:36:25	Log-Likelihood:	-991.86
No. Observations:	284	AIC:	1988.
Df Residuals:	282	BIC:	1995.
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	79.0400	0.975	81.049	0.000	77.120	80.960
x1	-2.5123	0.071	-35.142	0.000	-2.653	-2.372

Omnibus:	14.058	Durbin-Watson:	2.109
Prob(Omnibus):	0.001	Jarque-Bera (JB):	14.765
Skew:	-0.516	Prob(JB):	0.000622
Kurtosis:	3.427	Cond. No.	28.2

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

fig, ax = plt.subplots()
# 畫出訓練集和測試集資料點
rs_train_df.plot.scatter( x = "Density", y = "Speed", 
    label = "Training Data", c = "C0", ax = ax )
rs_test_df.plot.scatter( x = "Density", y = "Speed", 
    label = "Test Data", c = "C9", ax = ax )

# 第一段模型
plot_X = np.arange( 0, df[["Density"]].values.max(), 0.01 )
plot_X_with_bias = sm.add_constant(plot_X)
plot_Y = Greenshields_model.predict(plot_X_with_bias)
ax.plot(plot_X, plot_Y, label = "reg plot",
    color = "red")

plt.title("Greenshields' Model")
plt.legend()

<matplotlib.legend.Legend at 0x1fbf80f7f70>

檢視模型擬合的引數：

coef = Greenshields_model.params[1]
intercept = Greenshields_model.params[0]
coef, intercept

(-2.512260086391756, 79.0399551935032)

格林希爾茲模型精度評價

使用 Python 的機器學習庫 scikit-learn（sklearn）時，可以使用 sklearn.metrics 庫來進行模型評估和效能度量。sklearn.metrics 庫提供了多種函式和指標，可以直接呼叫。

對於迴歸分析，最常用的指標如：

mean_squared_error，均方誤差
mean_absolute_error，平均絕對誤差
r2_score，迴歸的決定係數

除此之外，均方根誤差可以對均方誤差取絕對值獲得。

R2（迴歸決定係數）是迴歸模型中用來衡量模型擬合優良程度的指標，它的值介於 0 和1之間。R2 越接近 1，表示模型的擬合度越好。

\[R^2= \dfrac{\text{SSR}}{\text{SST}} = \dfrac{\sum_{i=1}^{n}{ \left( \hat y_i - \bar y\right)^2}}{\sum_{i=1}^{n}{ \left(y_i - \bar y \right)^2}} \]

其中 \(\text{SSR}\) 是模型解釋的平方和，\(\text{SST}\) 是總的平方和。

MSE 是衡量模型預測誤差大小的指標，值越小表示模型的預測精度越高。

\[\text{MSE}=\frac{1}{n}\sum_{i=1}^{n}\left( y_i - \hat y_i \right)^{2} \]

其中n是樣本數量，\(y_i\) 是實際觀測值，\(\hat y_i\) 是模型預測值。

RMSE 是 MSE 的平方根，即：\(\text{RMSE} = \sqrt{\text{MSE}}\)，它和 MSE 一樣都是用來衡量模型預測誤差大小的指標，值越小表示模型的預測精度越高。

MAE是用來衡量模型預測誤差大小的另一個指標，它計算的是模型預測值與實際觀測值之間的絕對差的平均值。它的latex表示式為：

\[\text{MAE}=\frac{1}{n}\sum_{i=1}^{n}{\left| y_i - \hat y_i\right|} \]

MAE 對異常值更敏感，它的計算方式可能導致異常值對最終結果產生較大的影響。

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
# m用這個模型預測
rs_test_X_with_bias = sm.add_constant(rs_test_X)
y_pred = Greenshields_model.predict(rs_test_X_with_bias) 

# 計算誤差值
dist = {
    'R2 迴歸決定係數': r2_score(rs_test_Y, y_pred),
    'MSE 均方誤差': mean_squared_error(rs_test_Y, y_pred),
    'RMSE 均方根誤差': np.sqrt(mean_squared_error(rs_test_Y, y_pred)),
    'MAE 平均絕對誤差': mean_absolute_error(rs_test_Y, y_pred),
}
pd.DataFrame(dist, index=['結果']).transpose()

	結果
R2 迴歸決定係數	0.789087
MSE 均方誤差	81.391103
RMSE 均方根誤差	9.021702
MAE 平均絕對誤差	7.106664

檢驗線性關係的顯著性：

對於格林謝爾茲模型 \(v = v_f - \left[ \dfrac{v_f}{k_j} \right] \cdot k\)，有：\(q = k_j v - \left[ \dfrac{k_j}{v_f} \right] v^2\)

令 \(\begin{cases}a = v_f, \\ b = \left[ \dfrac{v_f}{k_j} \right] \end{cases}\)，則有：\(\begin{cases} q = ak - bk^2 \\ q = \dfrac{a v - v^2}{b} \end{cases}\)。

根據上述關係，畫出模型的流量 - 密度和速度 - 流量曲線如下圖所示：

fig, axes = plt.subplots(1, 2, figsize=(12, 4))
# 密度 - 流量關係圖
df.plot.scatter( x = 'Density', y = 'Flow', ax = axes[0] )
plot_X = np.arange(0, df[['Density']].values.max(), 0.01 )
plot_Y = intercept * plot_X + coef * ( plot_X **2 )
axes[0].plot( plot_X, plot_Y, c = 'red' )
# 流量 - 速度關係圖
df.plot.scatter( x='Flow', y='Speed', ax=axes[1])
plot_Y = np.arange(0, df[['Speed']].values.max(), 0.01 )
plot_X = (intercept * plot_Y - plot_Y **2 ) / ( -coef )
axes[1].plot( plot_X, plot_Y, c='red')

[<matplotlib.lines.Line2D at 0x1fbf83000a0>]

三角圖模型

三角圖模型是交通流分析中一種常見的 密度 - 流量模型。。

三角圖模型認為：流量 \(q\) 和密度 \(k\) 呈現兩段式的線性關係，即：

\[q = \begin{cases} v_f k, & k \le k_m \\ w k - w \cdot k_j, & k_m \le k \lt k_j \\ \end{cases} \]

其中，\(v_f\) 為自由流速度，\(k_j\) 為擁塞密度。兩直線交點處對應的密度值 \(k_m\) 為最大流量密度。

下圖展示了原始資料中密度和流量的散點圖關係。

fig = plt.figure()
ax = fig.gca()
df.plot.scatter( 
    x = "Density", y = "Flow", 
    ax = ax)

<AxesSubplot: xlabel='Density', ylabel='Flow'>

根據上圖的資訊，我們可以篩選用於學習的資料。

選擇資料集中流量最大的車流的 Density 值作為劃分資料集的依據。

透過 pandas.DataFrame 的布林索引進行查詢，則有：

max_flow_sample = df[
    df["Flow"] == df["Flow"].values.max()
]
max_flow_sample

	Time	Week	Dis	Flow	Speed	Density
4695	87	3	2.69	815	61.7	13.209076

可以認為：車流阻塞密度 \(k_j = 13.20907618\)

max_flow_sample["Density"].values

array([13.20907618])

劃分出兩段資料集

df_1 = df[
    df["Density"] <= max_flow_sample["Density"].values[0]]
df_2 = df[
    df["Density"] > max_flow_sample["Density"].values[0]]

可以在圖中畫出劃分的兩段資料：

fig, ax = plt.subplots()
ax.scatter(df_1[['Density']], df_1[['Flow']],
    facecolors='none', edgecolors='C0')
ax.scatter(df_2[['Density']], df_2[['Flow']],
    facecolors='none', edgecolors='C8')

<matplotlib.collections.PathCollection at 0x1fbf820c9a0>

檢驗兩段資料的正態性：

from scipy.stats import kstest
alpha = 0.05
print("\n第一段資料：\n")
for cols in df_1[["Flow", "Density", "Speed"]]:
    data_to_KS = df_1[cols].values
    u = data_to_KS.mean() # 計算均值
    std = data_to_KS.std() # 計算標準差
    result = kstest( data_to_KS, 'norm', (u, std) )
    print(str(cols), '是否正態: ', result[1] > alpha, 
        '; k 檢驗值：', result[1])

print("\n第二段資料：\n")
for cols in df_2[["Flow", "Density", "Speed"]]:
    data_to_KS = df_2[cols].values
    u = data_to_KS.mean() # 計算均值
    std = data_to_KS.std() # 計算標準差
    result = kstest( data_to_KS, 'norm', (u, std) )
    print(str(cols), '是否正態: ', result[1] > alpha, 
        '; k 檢驗值：', result[1])

第一段資料：

Flow 是否正態:  False ; k 檢驗值： 1.2168013674798825e-79
Density 是否正態:  False ; k 檢驗值： 5.9609132134836165e-74
Speed 是否正態:  False ; k 檢驗值： 1.5487599808266252e-187

第二段資料：

Flow 是否正態:  False ; k 檢驗值： 5.416601428442127e-05
Density 是否正態:  False ; k 檢驗值： 9.118215063885701e-29
Speed 是否正態:  False ; k 檢驗值： 2.5510983772550485e-07

import seaborn as sns

fig, axes = plt.subplots(1, 2, figsize=(12, 4.5))

axes[0].set_title("plot 1 corr")
corr = df_1.corr(method='spearman')
sns.heatmap(corr, cmap="bwr", ax = axes[0], annot=True) 

axes[1].set_title("plot 2 corr")
corr = df_2.corr(method='spearman')
sns.heatmap(corr, cmap="bwr", ax = axes[1], annot=True)

<AxesSubplot: title={'center': 'plot 2 corr'}>

劃分第一段資料

import statsmodels.api as sm
from sklearn.model_selection import train_test_split

train_df_1, test_df_1 = train_test_split(df_1, test_size = 0.2)
train_X_1 = train_df_1[["Density"]].values
train_Y_1 = train_df_1[["Flow"]].values
test_X_1 = test_df_1[["Density"]].values
test_Y_1 = test_df_1[["Flow"]].values

劃分第二段資料

import statsmodels.api as sm
from sklearn.model_selection import train_test_split

train_df_2, test_df_2 = train_test_split(df_2, test_size = 0.2)
train_X_2 = train_df_2[["Density"]].values
train_Y_2 = train_df_2[["Flow"]].values
test_X_2 = test_df_2[["Density"]].values
test_Y_2 = test_df_2[["Flow"]].values

這裡要注意的一個點是 sm.OLS() 中，資料的輸入順序是 Y 在前 X 在後。和一般習慣是相反的。

第一段資料：

import statsmodels.api as sm
# formular: Y = beta0 + beta1*X 
# 線性迴歸模型
# 加入截距 
# 注意引數排列的順序
model_1 = sm.OLS(
    train_Y_1, train_X_1
).fit() 
# 輸出結果到螢幕
model_1.summary()

OLS Regression Results
Dep. Variable:	y	R-squared (uncentered):	0.988
Model:	OLS	Adj. R-squared (uncentered):	0.988
Method:	Least Squares	F-statistic:	4.913e+05
Date:	Mon, 16 Oct 2023	Prob (F-statistic):	0.00
Time:	16:36:27	Log-Likelihood:	-31402.
No. Observations:	5963	AIC:	6.281e+04
Df Residuals:	5962	BIC:	6.281e+04
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
x1	64.9438	0.093	700.899	0.000	64.762	65.125

Omnibus:	4780.874	Durbin-Watson:	1.897
Prob(Omnibus):	0.000	Jarque-Bera (JB):	120432.628
Skew:	-3.734	Prob(JB):	0.00
Kurtosis:	23.711	Cond. No.	1.00

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
# make the predictions by the model 用這個模型預測
y_pred = model_1.predict(test_X_1) 

# 計算誤差值
dist = {
    'R2 迴歸決定係數': r2_score(test_Y_1, y_pred),
    'MSE 均方誤差': mean_squared_error(test_Y_1, y_pred),
    'RMSE 均方根誤差': np.sqrt(mean_squared_error(test_Y_1, y_pred)),
    'MAE 平均絕對誤差': mean_absolute_error(test_Y_1, y_pred),
}
pd.DataFrame(dist, index=['結果']).transpose()

	結果
R2 迴歸決定係數	0.962100
MSE 均方誤差	2081.450969
RMSE 均方根誤差	45.622922
MAE 平均絕對誤差	30.412174

第二段資料：

import statsmodels.api as sm
# formular: Y = beta0 + beta1*X 
# 線性迴歸模型
# 加入截距 
train_X_2 = sm.add_constant(train_X_2)
# 注意引數排列的順序
model_2 = sm.OLS(
    train_Y_2, train_X_2
).fit() 
# 輸出結果到螢幕
model_2.summary()

OLS Regression Results
Dep. Variable:	y	R-squared:	0.576
Model:	OLS	Adj. R-squared:	0.575
Method:	Least Squares	F-statistic:	658.9
Date:	Mon, 16 Oct 2023	Prob (F-statistic):	1.77e-92
Time:	16:36:27	Log-Likelihood:	-2823.6
No. Observations:	488	AIC:	5651.
Df Residuals:	486	BIC:	5660.
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	1081.8249	18.783	57.596	0.000	1044.919	1118.731
x1	-30.1653	1.175	-25.669	0.000	-32.474	-27.856

Omnibus:	26.007	Durbin-Watson:	2.027
Prob(Omnibus):	0.000	Jarque-Bera (JB):	29.220
Skew:	-0.540	Prob(JB):	4.52e-07
Kurtosis:	3.518	Cond. No.	84.3

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
# 用這個模型預測
test_X_2_with_bias = sm.add_constant(test_X_2)
y_pred = model_2.predict(test_X_2_with_bias) 

# 計算誤差值
dist = {
    'R2 迴歸決定係數': r2_score(test_Y_2, y_pred),
    'MSE 均方誤差': mean_squared_error(test_Y_2, y_pred),
    'RMSE 均方根誤差': np.sqrt(mean_squared_error(test_Y_2, y_pred)),
    'MAE 平均絕對誤差': mean_absolute_error(test_Y_2, y_pred),
}
pd.DataFrame(dist, index=['結果']).transpose()

	結果
R2 迴歸決定係數	0.637521
MSE 均方誤差	5667.229912
RMSE 均方根誤差	75.281006
MAE 平均絕對誤差	59.346431

畫出擬合線段

fig, ax = plt.subplots()
ax.scatter(df[['Density']], df[['Flow']],
    facecolors='none', edgecolors='C0')

# 第一段模型
plot_X_1 = np.arange( 0, 15, 0.01 )
plot_Y_1 = model_1.predict(plot_X_1)
ax.plot(plot_X_1, plot_Y_1, 
    color = "red")

# 第二段模型
plot_X_2 = np.arange( 5, 35, 0.01 )
plot_X_2_with_bias = sm.add_constant(plot_X_2)
plot_Y_2 = model_2.predict(plot_X_2_with_bias)
ax.plot(plot_X_2, plot_Y_2, 
    color = "green",)

[<matplotlib.lines.Line2D at 0x1fbe4bac100>]

將線性模型引數轉化為交通流模型引數：

vf = model_1.params[0]
w = model_2.params[1]
k_j = - ( model_2.params[0] / w )
vf, w, k_j

(64.94384137558185, -30.165280423323722, 35.86324554666523)

計算最大流量密度和最大車流：

k_m = (model_2.params[0]) / (model_1.params[0] - model_2.params[1])
q_max = (model_2.params[1] * k_m + model_2.params[0] )
print("最大流量密度：", k_m)
print("最大車流：", q_max)

最大流量密度： 11.374564693100982
最大車流： 738.7079251450441

由於已經知道：\(q = kv\)，故可知道 \(v\) 與 \(k\) 之間的關係為：

\[v = \begin{cases} v_f, & k \le k_m \\ w - \dfrac{w \cdot k_j}{k}, & k_m \le k \lt k_j \\ \end{cases} \]

fig, ax = plt.subplots()
ax.scatter(df[['Density']], df[['Speed']],
    facecolors='none', edgecolors='C0')

plot_X_1 = np.arange( k_m, k_j, 0.01 )
plot_Y_1 = w - ((w*k_j) / plot_X_1)
ax.plot( plot_X_1, plot_Y_1, 
    label = 'reg plot 1', c = 'red' )

plot_X_2 = np.arange( 0, k_m, 0.01 )
plot_Y_2 = vf + 0*plot_X_2
ax.plot( plot_X_2, plot_Y_2, 
    label = 'reg plot 2', c = 'green' )

ax.legend()

<matplotlib.legend.Legend at 0x1fbe4c1af10>

S3 模型

參考論文 Logistic modeling of the equilibrium speed–density relationship

模型的函式形式為：

\[v = \dfrac{v_f}{ \left[1 + \left(\dfrac{k}{k_c} \right)^m \right]^{\frac{2}{m}} } \]

其中 \(v_f\) 是自由流速度，\(k_c\) 是可導致最大交通流量的臨界密度，\(m\) 是待確定的正引數。

引數 \(m\) 可以解釋為 最大流量慣性系數。它反映了當密度偏離 \(k_c\) 值（即其對密度變化的抵抗程度）時，交通系統保持滿負荷狀態的能力；不同的 \(m\) 值可以表示沿速度 - 密度曲線的不同平滑度。

首先篩選資料點。

S3 模型是一個對資料分佈不敏感的模型，因此不需要減少資料點來保證資料均勻分佈。可以直接在原始資料中劃分訓練集和測試集。

from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(df, test_size = 0.1)
train_X = train_df[["Density"]].values
train_Y = train_df[["Speed"]].values
test_X = test_df[["Density"]].values
test_Y = test_df[["Speed"]].values

透過函式形式定義模型：

def S3_Model(x, vf, kc, m):
    return vf / ( ( 1 + ( x / kc )**m )**(2/m) )

透過 scipy.optimize 中的 curve_fit() 函式工具擬合自定義的函式並得到模型的引數：

from scipy.optimize import curve_fit
popt, pcov = curve_fit(
    S3_Model, 
    train_X.flatten(), train_Y.flatten() )
popt, pcov

C:\Users\asus\AppData\Local\Temp\ipykernel_13904\206827447.py:2: RuntimeWarning: overflow encountered in power
  return vf / ( ( 1 + ( x / kc )**m )**(2/m) )
C:\Users\asus\AppData\Local\Temp\ipykernel_13904\206827447.py:2: RuntimeWarning: divide by zero encountered in divide
  return vf / ( ( 1 + ( x / kc )**m )**(2/m) )





(array([71.27162496, 11.91753285,  6.31300456]),
 array([[ 0.00328001,  0.00032221, -0.00260432],
        [ 0.00032221,  0.00151838, -0.00327198],
        [-0.00260432, -0.00327198,  0.01023309]]))

擬合出來的引數 popt 是一個包含兩個元素的陣列。每個元素對應模型函式中的一個引數。

至於返回值中的協方差矩陣 pcov，它提供了擬合引數估計值的不確定性。對角線上的元素表示對應引數估計值的方差。非對角線上的元素表示兩個引數之間的協方差。

vf, kc, m = tuple(popt)
vf, kc, m

(71.27162496102929, 11.917532852789654, 6.313004558127913)

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
y_pred = vf / ( ( 1 + ( test_X / kc )**m )**(2/m) )

# calculate the performance metrics 計算誤差值
dist = {
    'R2 迴歸決定係數': r2_score(test_Y, y_pred),
    'MSE 均方誤差': mean_squared_error(test_Y, y_pred),
    'RMSE 均方根誤差': np.sqrt(mean_squared_error(test_Y, y_pred)),
    'MAE 平均絕對誤差': mean_absolute_error(test_Y, y_pred),
}
pd.DataFrame(dist, index=['結果']).transpose()

	結果
R2 迴歸決定係數	0.862163
MSE 均方誤差	12.117290
RMSE 均方根誤差	3.480990
MAE 平均絕對誤差	2.402587

fig, ax = plt.subplots()
plot_X = np.arange(0.01, df[['Density']].values.max(), 0.01)
plot_Y = popt[0] / (
    ( 1 + ( plot_X / popt[1] )**popt[2] )**(2/popt[2]) 
)
train_df.plot.scatter( x = "Density", y = "Speed", 
    label = "Training Data", c = "C0", ax = ax )
test_df.plot.scatter( x = "Density", y = "Speed", 
    label = "Test Data", c = "C9", ax = ax )
ax.plot( plot_X, plot_Y, c = "red", label = "poly reg")
ax.legend()

<matplotlib.legend.Legend at 0x1fbe4bd6cd0>

考慮守恆定律 \(q=kv\)，並透過將等式的兩側乘以密度\(k\)，可以得到流量 - 密度的公式：

\[q = \dfrac{v_f \cdot k}{ \left[1 + \left(\dfrac{k}{k_c} \right)^m \right]^{\frac{2}{m}} } \]

可以根據公式畫出擬合圖線。

fig, ax = plt.subplots()
plot_X = np.arange(0.01, df[['Density']].values.max(), 0.01)
plot_Y = popt[0] * plot_X / (
    ( 1 + ( plot_X / popt[1] )**popt[2] )**(2/popt[2]) 
)
train_df.plot.scatter( x = "Density", y = "Flow", 
    label = "Training Data", c = "C0", ax = ax )
test_df.plot.scatter( x = "Density", y = "Flow", 
    label = "Test Data", c = "C9", ax = ax )
ax.plot( plot_X, plot_Y, c = "red", label = "poly reg")
ax.legend()

<matplotlib.legend.Legend at 0x1fbe4ee2be0>

模型具有以下優勢：

速度是一個相對於密度嚴格單調遞減的函式
在自由流動狀態下，密度從零增加到一個低值（如 10veh/ km/ ln）
速度保持相對較慢，但在嚴重擁堵的條件下（或：在高密度範圍內）不會達到零
平滑地捕捉不同交通狀態之間的轉換
描述速度 - 密度關係的引數少
在 \(q=kv\) 守恆條件下，保持流密度和速度流平面之間的重要一致性

Multilayer Feed-forward Neural Network 多層前饋神經網路

前饋神經網路，是一種最簡單的神經網路，各神經元分層排列，每個神經元只與前一層的神經元相連。接收前一層的輸出，並輸出給下一層，各層間沒有反饋。是應用最廣泛、發展最迅速的人工神經網路之一。

該方法屬於非引數驅動的機器學習方法。

重新劃分資料集

這裡的學習集和測試集選擇原始資料中的 ['Time', 'Dis', 'Week'] 作為 X 資料，選擇 Flow 作為 Y 資料。

from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(df, test_size = 0.1)
train_X = train_df[['Time', 'Dis', 'Week']].values
train_Y = train_df[["Flow"]].values
test_X = test_df[['Time', 'Dis', 'Week']].values
test_Y = test_df[["Flow"]].values

訓練神經網路

神經網路各個引數如下：

引數名	含義	解釋
`learning_rate_init`	初始學習率	每次權重更新時的步長
`batch_size`	批次大小	每次迭代中用於更新權重的樣本數量
`tol`	容差值	當損失函式減少的程度小於容差值時訓練停止。
`hidden_layer_sizes`	隱藏層大小	指定了隱藏層中神經元的數量和層數
`max_iter`	最大迭代次數	訓練過程中允許的最大迭代次數。

這些引數用於配置 MLPRegressor 類的例項，即前饋神經網路模型。透過調整這些引數的值，可以對模型的訓練過程和效能進行控制和最佳化。

800 神經元，3 層

%time # 統計程式碼執行時間
# 匯入包多元感知器工具包
from sklearn.neural_network import MLPRegressor 
# 建立變數
mlpr_800_3 = MLPRegressor(
    learning_rate_init=1e-3, 
    batch_size=20,
    tol=1e-4, 
    hidden_layer_sizes=(
        800, # 每層神經元的個數
        ),  # 隱藏層的層數
    max_iter=1000
)
# 訓練模型
mlpr_800_3.fit(train_X, train_Y)

CPU times: total: 0 ns
Wall time: 0 ns


D:\Python\Python39\Lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:1625: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)

MLPRegressor(batch_size=20, hidden_layer_sizes=(800,), max_iter=1000)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

引數調整的注意事項：

較小的學習率可能導致收斂速度較慢，而較大的學習率可能導致無法收斂。通常需要根據具體問題進行調整。
較大的批次大小可以加快訓練速度，但也會增加記憶體需求和計算負荷。較小的批次大小可能會導致訓練不穩定。
容差值（tolerance）的選擇可以幫助避免在達到最優解之前浪費時間和計算資源。
(800, 4)表示有三個隱藏層，其中包含800個神經元。如果不填寫層數，如 (800, )，則預設是 1 個隱藏層。
隱藏層的大小和數量會影響模型的複雜性和表示能力，需要根據問題進行調整和實驗。
如果達到最大迭代次數而沒有收斂，訓練過程將停止。需要根據具體問題和計算資源進行調整。

計算預測值並評估模型精度

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
y_pred = mlpr_800_3.predict(test_X)

dist = {
    'R2 迴歸決定係數': r2_score(test_Y, y_pred),
    'MSE 均方誤差': mean_squared_error(test_Y, y_pred),
    'RMSE 均方根誤差': np.sqrt(mean_squared_error(test_Y, y_pred)),
    'MAE 平均絕對誤差': mean_absolute_error(test_Y, y_pred),
}
result_800_3 = pd.DataFrame(dist, index=['800, 3']).transpose()
result_800_3

	800, 3
R2 迴歸決定係數	0.947422
MSE 均方誤差	3133.644385
RMSE 均方根誤差	55.978964
MAE 平均絕對誤差	41.062573

400 神經元，4 層

嘗試減少神經元的個數，增加層數：

%time # 統計程式碼執行時間
# 匯入包多元感知器工具包
from sklearn.neural_network import MLPRegressor 
# 建立變數
mlpr_400_4 = MLPRegressor(
    learning_rate_init=1e-3, 
    batch_size=20,
    tol=1e-4, 
    hidden_layer_sizes=(400, 400),
    max_iter=1000
)
# 訓練模型
mlpr_400_4.fit(train_X, train_Y)

CPU times: total: 0 ns
Wall time: 0 ns


D:\Python\Python39\Lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:1625: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)

MLPRegressor(batch_size=20, hidden_layer_sizes=(400, 400), max_iter=1000)