資料探索很麻煩？推薦一款史上最強大的特徵分析視覺化工具：yellowbrick

路遠發表於2019-09-04

原文網址 : https://segmentfault.com/a/1190000020270830

特徵視覺化

作者：xiaoyu

微信公眾號：Python資料科學

前言

玩過建模的朋友都知道，在建立模型之前有很長的一段特徵工程工作要做，而在特徵工程的過程中，探索性資料分析又是必不可少的一部分，因為如果我們要對各個特徵進行細緻的分析，那麼必然會進行一些視覺化以輔助我們來做選擇和判斷。

視覺化的工具有很多，但是能夠針對特徵探索性分析而進行專門視覺化的不多，今天給大家介紹一款功能十分強大的工具：yellowbrick，希望通過這個工具的輔助可以節省更多探索的時間，快速掌握特徵資訊。

功能

雷達 RadViz

RadViz雷達圖是一種多變數資料視覺化演算法，它圍繞圓周均勻地分佈每個特徵，並且標準化了每個特徵值。一般資料科學家使用此方法來檢測類之間的關聯。例如，是否有機會從特徵集中學習一些東西或是否有太多的噪音？

# Load the classification data set
data = load_data("occupancy")

# Specify the features of interest and the classes of the target
features = ["temperature", "relative humidity", "light", "C02", "humidity"]
classes = ["unoccupied", "occupied"]

# Extract the instances and target
X = data[features]
y = data.occupancy

# Import the visualizer
from yellowbrick.features import RadViz

# Instantiate the visualizer
visualizer = RadViz(classes=classes, features=features)

visualizer.fit(X, y)      # Fit the data to the visualizer
visualizer.transform(X)   # Transform the data
visualizer.poof()         # Draw/show/poof the data

從上面雷達圖可以看出5個維度中，溫度對於目標類的影響是比較大的。

一維排序 Rank 1D

特徵的一維排序利用排名演算法，僅考慮單個特徵，預設情況下使用Shapiro-Wilk演算法來評估與特徵相關的例項分佈的正態性，然後繪製一個條形圖，顯示每個特徵的相對等級。

from yellowbrick.features import Rank1D

# Instantiate the 1D visualizer with the Sharpiro ranking algorithm
visualizer = Rank1D(features=features, algorithm='shapiro')

visualizer.fit(X, y)                # Fit the data to the visualizer
visualizer.transform(X)             # Transform the data
visualizer.poof()                   # Draw/show/poof the data

PCA Projection

PCA分解視覺化利用主成分分析將高維資料分解為二維或三維，以便可以在散點圖中繪製每個例項。PCA的使用意味著可以沿主要變化軸分析投影資料集，並且可以解釋該資料集以確定是否可以利用球面距離度量。

雙重圖 Biplot

PCA投影可以增強到雙點，其點是投影例項，其向量表示高維空間中資料的結構。通過使用proj_features = True標誌，資料集中每個要素的向量將在散點圖上以該要素的最大方差方向繪製。這些結構可用於分析特徵對分解的重要性或查詢相關方差的特徵以供進一步分析。

# Load the classification data set
data = load_data('concrete')

# Specify the features of interest and the target
target = "strength"
features = [
    'cement', 'slag', 'ash', 'water', 'splast', 'coarse', 'fine', 'age'
]

# Extract the instance data and the target
X = data[features]
y = data[target]

visualizer = PCADecomposition(scale=True, proj_features=True)
visualizer.fit_transform(X, y)
visualizer.poof()

特徵重要性 Feature Importance

特徵工程過程涉及選擇生成有效模型所需的最小特徵，因為模型包含的特徵越多，它就越複雜（資料越稀疏），因此模型對方差的誤差越敏感。消除特徵的常用方法是描述它們對模型的相對重要性，然後消除弱特徵或特徵組合並重新評估以確定模型在交叉驗證期間是否更好。

在scikit-learn中，Decision Tree模型和樹的集合（如Random Forest，Gradient Boosting和AdaBoost）在擬合時提供feature_importances_屬性。Yellowbrick FeatureImportances視覺化工具利用此屬性對相對重要性進行排名和繪製。

import matplotlib.pyplot as plt

from sklearn.ensemble import GradientBoostingClassifier

from yellowbrick.features.importances import FeatureImportances

# Create a new matplotlib figure
fig = plt.figure()
ax = fig.add_subplot()

viz = FeatureImportances(GradientBoostingClassifier(), ax=ax)
viz.fit(X, y)
viz.poof()

遞迴特徵消除 Recursive Feature Elimination

遞迴特徵消除（RFE）是一種特徵選擇方法，它訓練模型並刪除最弱的特徵（或多個特徵），直到達到指定數量的特徵。特徵按模型的coef_或feature_importances_屬性排序，並通過遞迴消除每個迴圈的少量特徵，RFE嘗試消除模型中可能存在的依賴性和共線性。

RFE需要保留指定數量的特徵，但事先通常不知道有多少特徵有效。為了找到最佳數量的特徵，交叉驗證與RFE一起用於對不同的特徵子集進行評分，並選擇最佳評分特徵集合。RFECV視覺化繪製模型中的特徵數量以及它們的交叉驗證測試分數和可變性，並視覺化所選數量的特徵。

from sklearn.svm import SVC
from sklearn.datasets import make_classification

from yellowbrick.features import RFECV

# Create a dataset with only 3 informative features
X, y = make_classification(
    n_samples=1000, n_features=25, n_informative=3, n_redundant=2,
    n_repeated=0, n_classes=8, n_clusters_per_class=1, random_state=0
)

# Create RFECV visualizer with linear SVM classifier
viz = RFECV(SVC(kernel='linear', C=1))
viz.fit(X, y)
viz.poof()

該圖顯示了理想的RFECV曲線，當捕獲三個資訊特徵時，曲線跳躍到極好的準確度，然後隨著非資訊特徵被新增到模型中，精度逐漸降低。陰影區域表示交叉驗證的可變性，一個標準偏差高於和低於曲線繪製的平均精度得分。

下面是一個真實資料集，我們可以看到RFECV對信用違約二元分類器的影響。

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold

df = load_data('credit')

target = 'default'
features = [col for col in data.columns if col != target]

X = data[features]
y = data[target]

cv = StratifiedKFold(5)
oz = RFECV(RandomForestClassifier(), cv=cv, scoring='f1_weighted')

oz.fit(X, y)
oz.poof()

在這個例子中，我們可以看到選擇了19個特徵，儘管在大約5個特徵之後模型的f1分數似乎沒有太大改善。選擇要消除的特徵在確定每個遞迴的結果中起著重要作用；修改步驟引數以在每個步驟中消除多個特徵可能有助於儘早消除最差特徵，增強其餘特徵（並且還可用於加速具有大量特徵的資料集的特徵消除）。

殘差圖 Residuals Plot

在迴歸模型的上下文中，殘差是目標變數（y）的觀測值與預測值（ŷ）之間的差異，例如，預測的錯誤。殘差圖顯示垂直軸上的殘差與水平軸上的因變數之間的差異，允許檢測目標中可能容易出錯或多或少的誤差的區域。

from sklearn.linear_model import Ridge
from yellowbrick.regressor import ResidualsPlot

# Instantiate the linear model and visualizer
ridge = Ridge()
visualizer = ResidualsPlot(ridge)

visualizer.fit(X_train, y_train)  # Fit the training data to the model
visualizer.score(X_test, y_test)  # Evaluate the model on the test data
visualizer.poof()                 # Draw/show/poof the data

正則化 Alpha Selection

正則化旨在懲罰模型複雜性，因此α越高，模型越複雜，由於方差（過度擬合）而減少誤差。另一方面，太高的Alpha會因偏差（欠調）而增加誤差。因此，重要的是選擇最佳α，以便在兩個方向上最小化誤差。 AlphaSelection Visualizer演示了不同的α值如何影響線性模型正則化過程中的模型選擇。一般而言，α增加了正則化的影響，例如，如果alpha為零，則沒有正則化，α越高，正則化引數對最終模型的影響越大。

import numpy as np

from sklearn.linear_model import LassoCV
from yellowbrick.regressor import AlphaSelection

# Create a list of alphas to cross-validate against
alphas = np.logspace(-10, 1, 400)

# Instantiate the linear model and visualizer
model = LassoCV(alphas=alphas)
visualizer = AlphaSelection(model)

visualizer.fit(X, y)
g = visualizer.poof()

分類預測誤差 Class Prediction Error

類預測誤差圖提供了一種快速瞭解分類器在預測正確類別方面有多好的方法。

from sklearn.ensemble import RandomForestClassifier

from yellowbrick.classifier import ClassPredictionError

# Instantiate the classification model and visualizer
visualizer = ClassPredictionError(
    RandomForestClassifier(), classes=classes
)

# Fit the training data to the visualizer
visualizer.fit(X_train, y_train)

# Evaluate the model on the test data
visualizer.score(X_test, y_test)

# Draw visualization
g = visualizer.poof()

當然也同時有分類評估指標的視覺化，包括混淆矩陣、AUC/ROC、召回率/精準率等等。

二分類辨別閾值 Discrimination Threshold

關於二元分類器的辨別閾值的精度，召回，f1分數和queue rate的視覺化。辨別閾值是在陰性類別上選擇正類別的概率或分數。通常，將其設定為50％，但可以調整閾值以增加或降低對誤報或其他應用因素的敏感度。

from sklearn.linear_model import LogisticRegression
from yellowbrick.classifier import DiscriminationThreshold

# Instantiate the classification model and visualizer
logistic = LogisticRegression()
visualizer = DiscriminationThreshold(logistic)

visualizer.fit(X, y)  # Fit the training data to the visualizer
visualizer.poof()     # Draw/show/poof the data

聚類肘部法則 Elbow Method

KElbowVisualizer實現了“肘部”法則，通過使模型具有K的一系列值來幫助資料科學家選擇最佳簇數。如果折線圖類似於手臂，那麼“肘”（拐點）就是曲線）是一個很好的跡象，表明基礎模型最適合那一點。

在下面的示例中，KElbowVisualizer在具有8個隨機點集的樣本二維資料集上適合KMeans模型，以獲得4到11的K值範圍。當模型適合8個聚類時，我們可以在圖中看到“肘部”，在這種情況下，我們知道它是最佳數字。

from sklearn.datasets import make_blobs

# Create synthetic dataset with 8 random clusters
X, y = make_blobs(centers=8, n_features=12, shuffle=True, random_state=42)

from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer

# Instantiate the clustering model and visualizer
model = KMeans()
visualizer = KElbowVisualizer(model, k=(4,12))

visualizer.fit(X)    # Fit the data to the visualizer
visualizer.poof()    # Draw/show/poof the data

叢集間距離圖 Intercluster Distance Maps

叢集間距離地圖以2維方式顯示叢集中心的嵌入，並保留與其他中心的距離。例如。中心越靠近視覺化，它們就越接近原始特徵空間。根據評分指標調整叢集的大小。預設情況下，它們按內部資料的多少，例如屬於每個中心的例項數。這給出了叢集的相對重要性。但請注意，由於兩個聚類在2D空間中重疊，因此並不意味著它們在原始特徵空間中重疊。

from sklearn.datasets import make_blobs

# Make 12 blobs dataset
X, y = make_blobs(centers=12, n_samples=1000, n_features=16, shuffle=True)

from sklearn.cluster import KMeans
from yellowbrick.cluster import InterclusterDistance

# Instantiate the clustering model and visualizer
visualizer = InterclusterDistance(KMeans(9))

visualizer.fit(X) # Fit the training data to the visualizer
visualizer.poof() # Draw/show/poof the data

模型選擇-學習曲線 Learning Curve

學習曲線基於不同數量的訓練樣本，檢驗模型訓練分數與交叉驗證測試分數的關係。這種視覺化通常用來表達兩件事：

模型會不會隨著資料量增多而效果變好
模型對偏差和方差哪個更加敏感

下面是利用yellowbrick生成的學習曲線視覺化圖。該學習曲線對於分類、迴歸和聚類都可以適用。

模型選擇-驗證曲線 Validation Curve

模型驗證用於確定模型對其已經過訓練的資料的有效性以及它對新輸入的泛化程度。為了測量模型的效能，我們首先將資料集拆分為訓練和測試，將模型擬合到訓練資料上並在保留的測試資料上進行評分。

為了最大化分數，必須選擇模型的超引數，以便最好地允許模型在指定的特徵空間中操作。大多數模型都有多個超引數，選擇這些引數組合的最佳方法是使用網格搜尋。然而，繪製單個超引數對訓練和測試資料的影響有時是有用的，以確定模型是否對某些超引數值不適合或過度擬合。

import numpy as np

from sklearn.tree import DecisionTreeRegressor
from yellowbrick.model_selection import ValidationCurve

# Load a regression dataset
data = load_data('energy')

# Specify features of interest and the target
targets = ["heating load", "cooling load"]
features = [col for col in data.columns if col not in targets]

# Extract the instances and target
X = data[features]
y = data[targets[0]]

viz = ValidationCurve(
    DecisionTreeRegressor(), param_name="max_depth",
    param_range=np.arange(1, 11), cv=10, scoring="r2"
)

# Fit and poof the visualizer
viz.fit(X, y)
viz.poof()

總結

個人認為yellowbrick這個工具非常好，一是因為解決了特徵工程和建模過程中的視覺化問題，極大地簡化了操作；二是通過各種視覺化也可以補充自己對建模的一些盲區。

本篇僅展示了建模中部分視覺化功能，詳細的完整功能請參考：

https://www.scikit-yb.org/en/...

如果覺得有幫助，還請給點個贊！

歡迎關注我的個人公眾號：Python資料科學
圖片描述

視覺化之資料視覺化最強工具推薦
2023-02-27
視覺化
推薦一款Python資料視覺化神器
2020-05-07
Python視覺化
玩轉大資料視覺化，推薦幾個必學的工具！
2019-01-08
大資料視覺化
資料分析師之視覺化工具推薦指南
2019-01-17
視覺化
Apache Superset是一款視覺化探索大資料的開源新工具 - thenewstack
2021-02-17
Apache視覺化大資料
推薦一款好用的程式碼視覺化工具
2023-12-11
視覺化
五款精品資料視覺化工具推薦
2021-12-22
視覺化
Python視覺化神器Yellowbrick使用
2020-04-06
Python視覺化
求推薦好用的視覺化大屏軟體？強推奧威BI
2023-10-08
視覺化
視覺化地圖怎麼做？推薦一款免費工具
2020-09-25
視覺化地圖
【推薦】常見的Python資料視覺化庫
2021-06-10
Python視覺化
2023最值得推薦的5款零程式設計資料視覺化軟體
2023-02-08
程式設計視覺化
資料視覺化開發必備的10款工具，真強大
2022-03-10
視覺化
推薦一款實用性很強的小程式IDE
2022-06-02
IDE
大資料視覺化分析工具常用的有哪些？
2023-12-08
大資料視覺化
分析快、易操作的資料分析工具推薦
2023-01-09
大資料分析視覺化工具怎麼選
2021-12-01
大資料視覺化
PyCon2018：兩款最新ML資料視覺化庫：Altair和Yellowbrick
2018-06-08
視覺化AI
太愛了！一款基於智慧推薦的Python資料探索(EDA)工具來了！
2020-12-25
Python
一款強大的資料庫提取資料工具
2024-07-12
資料庫
強！推薦一款Python開源自動化指令碼工具：AutoKey！
2024-09-14
Python指令碼
vi裡邊的正則總是很麻煩
2019-02-09
推薦 vue2、vue3 中功能最強大的表格元件，效能最強大的表格元件推薦、可編輯表格推薦
2024-11-21
Vue元件
推薦一款網路資料抓包分析工具：Debookee 7 Mac版
2020-08-27
Mac
2021年最值得推薦的報表工具，無程式碼輕鬆實現視覺化開發
2021-01-25
視覺化
如何實現報表視覺化，有沒有工具推薦
2023-03-07
視覺化
開發BI大資料分析視覺化系統
2019-09-16
大資料視覺化
中國大學排名資料分析與視覺化
2024-05-29
視覺化
推薦一款功能強大的Tomcat 管理監控工具，可替代Tomcat Manager
2018-09-12
Tomcat
資料視覺化能否代替資料分析
2021-12-01
視覺化
大資料視覺化的特點
2022-05-23
大資料視覺化
視覺化資料分析軟體
2021-11-30
視覺化
推薦一款Python介面自動化測試資料提取分析神器！
2024-07-11
Python
資料分析 | 資料視覺化圖表，BI工具構建邏輯
2020-06-02
視覺化
做資料分析，推薦7款好用的Python工具!
2021-02-03
Python
史上最簡單的推薦系統設計
2019-05-11
Python資料科學（八）- 資料探索與資料視覺化
2019-03-02
Python資料科學視覺化
Docker-視覺化管理工具總結-推薦使用Portainer
2022-03-06
Docker視覺化AI