Scikit Learn: 在python中機器學習

Warning

警告：有些沒能理解的句子，我以自己的理解意譯。

翻譯自：Scikit Learn:Machine Learning in Python

作者: Fabian Pedregosa, Gael Varoquaux

先決條件

Numpy, Scipy
IPython
matplotlib
scikit-learn

警告：在0.9版中(2011年9月發行)，scikit-learn的匯入路徑從scikits.learn更改為sklearn

載入示例資料

首先我們載入一些用來玩耍的資料。我們將使用的資料是非常簡單的著名的花朵資料——安德森鳶尾花卉資料集。

我們有一百五十個鳶尾花的一些尺寸的觀測值：萼片長度、寬度，花瓣長度和寬度。還有它們的亞屬：山鳶尾（Iris setosa）、變色鳶尾（Iris versicolor）和維吉尼亞鳶尾（Iris virginica）

向python物件載入資料：

In [1]: from sklearn import datasets
In [2]: iris = datasets.load_iris()

資料儲存在.data項中，是一個(n_samples, n_features)陣列。

In [3]: iris.data.shape
Out[3]: (150, 4)

每個觀察物件的種類存貯在資料集的.target屬性中。這是一個長度為n_samples的整數一維陣列:

In [5]: iris.target.shape
Out[5]: (150,)

In [6]: import numpy as np

In [7]: np.unique(iris.target)
Out[7]: array([0, 1, 2])

一個改變資料集大小的示例：數碼資料集(digits datasets)

數碼資料集¹包括1797個影象，每一個都是個代表手寫數字的8x8畫素影象

In [8]: digits = datasets.load_digits()

In [9]: digits.images.shape
Out[9]: (1797, 8, 8)

In [10]: import pylab as pl

In [11]: pl.imshow(digits.images[0], cmap=pl.cm.gray_r) 
Out[11]: <matplotlib.image.AxesImage at 0x3285b90>

In [13]: pl.show()

為了在scikit中使用這個資料集，我們把每個8x8影象轉換成長度為64的向量。(譯者注：或者直接用digits.data)

In [12]: data = digits.images.reshape((digits.images.shape[0], -1))

學習和預測

現在我們已經獲得一些資料，我們想要從中學習和預測一個新的資料。在scikit-learn中，我們通過建立一個估計器(estimator)從已經存在的資料學習，並且呼叫它的fit(X,Y)方法。

In [14]: from sklearn import svm

In [15]: clf = svm.LinearSVC()

In [16]: clf.fit(iris.data, iris.target) # learn from the data 
Out[16]: 
LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='l2', multi_class='ovr', penalty='l2',
     tol=0.0001, verbose=0)

一旦我們已經從資料學習，我們可以使用我們的模型來預測未觀測資料最可能的結果。

In [17]: clf.predict([[ 5.0,  3.6,  1.3,  0.25]])
Out[17]: array([0], dtype=int32)

注意：我們可以通過它以下劃線結束的屬性存取模型的引數：

In [18]: clf.coef_  
Out[18]: 
array([[ 0.18424352,  0.45122644, -0.8079467 , -0.45071302],
       [ 0.05190619, -0.89423619,  0.40519245, -0.93781587],
       [-0.85087844, -0.98667529,  1.38088883,  1.86538111]])

分類

K最近鄰(KNN)分類器

最簡單的可能的分類器是最近鄰：給定一個新的觀測值，將n維空間中最靠近它的訓練樣本標籤給它。其中n是每個樣本中特性(features)數。

k最近鄰²分類器內部使用基於球樹(ball tree)³來代表它訓練的樣本。

KNN分類示例：

In [19]: # Create and fit a nearest-neighbor classifier

In [20]: from sklearn import neighbors

In [21]: knn = neighbors.KNeighborsClassifier()

In [22]: knn.fit(iris.data, iris.target) 
Out[22]: 
KNeighborsClassifier(algorithm='auto', leaf_size=30, n_neighbors=5, p=2,
           warn_on_equidistant=True, weights='uniform')

In [23]: knn.predict([[0.1, 0.2, 0.3, 0.4]])
Out[23]: array([0])

訓練集和測試集

當驗證學習演算法時，不要用一個用來擬合估計器的資料來驗證估計器的預測非常重要。確實，通過kNN估計器，我們將總是獲得關於訓練集完美的預測。

In [24]: perm = np.random.permutation(iris.target.size)

In [25]: iris.data = iris.data[perm]

In [26]: iris.target = iris.target[perm]

In [27]: knn.fit(iris.data[:100], iris.target[:100]) 
Out[27]: 
KNeighborsClassifier(algorithm='auto', leaf_size=30, n_neighbors=5, p=2,
           warn_on_equidistant=True, weights='uniform')

In [28]: knn.score(iris.data[100:], iris.target[100:]) 
/usr/lib/python2.7/site-packages/sklearn/neighbors/classification.py:129: NeighborsWarning: kneighbors: neighbor k+1 and neighbor k have the same distance: results will be dependent on data order.
  neigh_dist, neigh_ind = self.kneighbors(X)
Out[28]: 0.95999999999999996

Bonus的問題：為什麼我們使用隨機的排列？

分類支援向量機(SVMs)

線性支援向量機

SVMs⁴嘗試構建一個兩個類別的最大間隔超平面。它選擇輸入的子集，呼叫支援向量即離分離的超平面最近的樣本點。

In [60]: from sklearn import svm

In [61]: svc = svm.SVC(kernel='linear')

In [62]: svc.fit(iris.data, iris.target)
Out[62]: 
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='linear', probability=False, shrinking=True, tol=0.001,
  verbose=False)

scikit-learn中有好幾種支援向量機實現。最普遍使用的是svm.SVC，svm.NuSVC和svm.LinearSVC;“SVC”代表支援向量分類器(Support Vector Classifier)(也存在迴歸SVMs，在scikit-learn中叫作“SVR”)。

練習

訓練一個數字資料集的svm.SVC。省略最後10%並且檢驗觀測值的預測表現。

使用核

類別不總是可以用超平面分離，所以人們指望有些可能是多項式或指數例項的非線性決策函式：

線性核

svc = svm.SVC(kernel=’linear’)
多項式核

svc = svm.SVC(kernel=’poly’, … degree=3) # degree: polynomial degree
RBF核(徑向基函式)⁵

svc = svm.SVC(kernel=’rbf’) # gamma: inverse of size of # radial kernel

練習

以上提到的哪些核對數字資料集有更好的預測效能？(譯者：前兩個)

聚類：將觀測值聚合

給定鳶尾花資料集，如果我們知道這有三種鳶尾花，但是無法得到它們的標籤，我們可以嘗試非監督學習：我們可以通過某些標準聚類觀測值到幾個組別裡。

k均值聚類

最簡答的聚類演算法是k均值演算法。這將一個資料分成k個叢集，以最小化觀測值(n維空間中)到聚類中心的均值來分配每個觀測點到叢集;然後均值重新被計算。這個操作遞迴執行直到聚類收斂，在max_iter回合內到最大值。⁶

(一個替代的k均值演算法實現在scipy中的cluster包中。這個scikit-learn實現與之不同，通過提供物件API和幾個額外的特性，包括智慧初始化。)

In [82]: from sklearn import cluster, datasets

In [83]: iris = datasets.load_iris()

In [84]: k_means = cluster.KMeans(k=3)

In [85]: k_means.fit(iris.data) 
Out[85]: 
KMeans(copy_x=True, init='k-means++', k=3, max_iter=300, n_init=10, n_jobs=1,
    precompute_distances=True,
    random_state=<mtrand.RandomState object at 0x7f4d860642d0>, tol=0.0001,
    verbose=0)

In [86]: print k_means.labels_[::10]
[1 1 1 1 1 2 2 2 2 2 0 0 0 0 0]

In [87]: print iris.target[::10]
[0 0 0 0 0 1 1 1 1 1 2 2 2 2 2]

應用到影象壓縮

譯者注：Lena是經典的影象處理例項影象, 8位灰度色深, 尺寸512 x 512

聚類可以被看作是一種從資訊中選擇一小部分觀測值。例如，這個可以被用來海報化一個影象(將連續變化的色調轉換成更少幾個色調)：

In [95]: from scipy import misc

In [96]: lena = misc.lena().astype(np.float32)

In [97]: X = lena.reshape((-1, 1)) # We need an (n_sample, n_feature) array

In [98]: k_means = cluster.KMeans(5)

In [99]: k_means.fit(X)
Out[99]: 
KMeans(copy_x=True, init='k-means++', k=5, max_iter=300, n_init=10, n_jobs=1,
    precompute_distances=True,
    random_state=<mtrand.RandomState object at 0x7f4d860642d0>, tol=0.0001,
    verbose=0)

In [100]: values = k_means.cluster_centers_.squeeze()

In [101]: labels = k_means.labels_

In [102]: lena_compressed = np.choose(labels, values)

In [103]: lena_compressed.shape = lena.shape

譯者注：想看效果？

In [31]: import matplotlib.pyplot as plt

In [32]: plt.gray()

In [33]: plt.imshow(lena_compressed)
Out[33]: <matplotlib.image.AxesImage at 0x4b2c510>

In [34]: plt.show()

原圖類似。

![Image]

用主成分分析降維

以上根據觀測值標記的點雲在一個方向非常平坦，所以一個特性幾乎可以用其它兩個確切地計算。PCA發現哪個方向的資料不是平的並且它可以通過在一個子空間投影來降維。

警告：PCA將在模組decomposition或pca中，這取決於你scikit-learn的版本。

In [75]: from sklearn import decomposition

In [76]: pca = decomposition.PCA(n_components=2)

In [77]: pca.fit(iris.data)
Out[77]: PCA(copy=True, n_components=2, whiten=False)

In [78]: X = pca.transform(iris.data)

現在我們可以視覺化(降維過的)鳶尾花資料集：

In [79]: import pylab as pl

In [80]: pl.scatter(X[:, 0], X[:, 1], c=iris.target)
Out[80]: <matplotlib.collections.PathCollection at 0x4104310>

PCA不僅在視覺化高維資料集時非常有用。它可以用來作為幫助加速對高維資料不那麼有效率的監督方法⁷的預處理步驟。

將一切放在一起：人臉識別

一個例項使用主成分分析來降維和支援向量機來分類進行人臉識別。

譯者注：讓程式自動下載(確保聯網，檔案較大，要等待很久)或者手動下載資料並放到./scikit_learn_data/lfw_home/下。

"""
Stripped-down version of the face recognition example by Olivier Grisel

http://scikit-learn.org/dev/auto_examples/applications/face_recognition.html

## original shape of images: 50, 37
"""
import numpy as np
import pylab as pl
from sklearn import cross_val, datasets, decomposition, svm

# ..
# .. load data ..
lfw_people = datasets.fetch_lfw_people(min_faces_per_person=70, resize=0.4)
perm = np.random.permutation(lfw_people.target.size)
lfw_people.data = lfw_people.data[perm]
lfw_people.target = lfw_people.target[perm]
faces = np.reshape(lfw_people.data, (lfw_people.target.shape[0], -1))
train, test = iter(cross_val.StratifiedKFold(lfw_people.target, k=4)).next()
X_train, X_test = faces[train], faces[test]
y_train, y_test = lfw_people.target[train], lfw_people.target[test]

# ..
# .. dimension reduction ..
pca = decomposition.RandomizedPCA(n_components=150, whiten=True)
pca.fit(X_train)
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)

# ..
# .. classification ..
clf = svm.SVC(C=5., gamma=0.001)
clf.fit(X_train_pca, y_train)

# ..
# .. predict on new images ..
for i in range(10):
    print lfw_people.target_names[clf.predict(X_test_pca[i])[0]]
    _ = pl.imshow(X_test[i].reshape(50, 37), cmap=pl.cm.gray)
    _ = raw_input()

全部程式碼：face.py

線性模型：從迴歸到稀疏

糖尿病資料集

糖尿病資料集包含442個病人的測量而得的10項生理指標(年齡，性別，體重，血壓)，和一年後疾病進展的指示：

In [104]: diabetes = datasets.load_diabetes()

In [105]: diabetes_X_train = diabetes.data[:-20]

In [106]: diabetes_X_test  = diabetes.data[-20:]

In [107]: diabetes_y_train = diabetes.target[:-20]

In [108]: diabetes_y_test  = diabetes.target[-20:]

這個手頭的任務是用來從生理指標預測疾病。

稀疏模型

為了改善問題的條件(無資訊變數，減少維度的不利影響，作為一個特性(feature)選擇的預處理，等等)，我們只關注有資訊的特性將沒有資訊的特性設定為0.這個罰則函式法⁸,叫作套索(Lasso)⁹，可以將一些係數設定為0.這些方法叫作稀疏方法(sparse method)，稀疏化可以被視作奧卡姆剃刀：相對於複雜模型更傾向於簡單的。

In [109]: from sklearn import linear_model

In [110]: regr = linear_model.Lasso(alpha=.3)

In [111]: regr.fit(diabetes_X_train, diabetes_y_train)
Out[111]: 
Lasso(alpha=0.3, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute='auto', tol=0.0001,
   warm_start=False)

In [112]: regr.coef_ # very sparse coefficients
Out[112]: 
array([   0.        ,   -0.        ,  497.34075682,  199.17441034,
         -0.        ,   -0.        , -118.89291545,    0.        ,
        430.9379595 ,    0.        ])

In [113]: regr.score(diabetes_X_test, diabetes_y_test) 
Out[113]: 0.55108354530029791

這個分數和線性迴歸(最小二乘法)非常相似：

In [114]: lin = linear_model.LinearRegression()

In [115]: lin.fit(diabetes_X_train, diabetes_y_train) 
Out[115]: LinearRegression(copy_X=True, fit_intercept=True, normalize=False)

In [116]: lin.score(diabetes_X_test, diabetes_y_test) 
Out[116]: 0.58507530226905713

同一問題的不同演算法

同一數學問題可以用不同演算法解決。例如,sklearn中的Lasso物件使用座標下降(coordinate descent)方法¹⁰解決套索迴歸，這在大資料集時非常有效率。然而，sklearn也提供了LassoLARS物件，使用LARS這種在解決權重向量估計非常稀疏，觀測值很少的問題很有效率的方法。

模型選擇：選擇估計器和它們的引數

格點搜尋和交叉驗證估計器

格點搜尋

scikit-learn提供了一個物件，該物件給定資料，在擬合一個引數網格的估計器時計算分數，並且選擇引數最大化交叉驗證分數。這個物件在構建時採用一個估計器並且暴露一個估計器API：

In [117]: from sklearn import svm, grid_search

In [118]: gammas = np.logspace(-6, -1, 10)

In [119]: svc = svm.SVC()

In [120]: clf = grid_search.GridSearchCV(estimator=svc, param_grid=dict(gamma=gammas),n_jobs=-1)

In [121]: clf.fit(digits.data[:1000], digits.target[:1000]) 
Out[121]: 
GridSearchCV(cv=None,
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='rbf', probability=False, shrinking=True, tol=0.001,
  verbose=False),
       fit_params={}, iid=True, loss_func=None, n_jobs=-1,
       param_grid={'gamma': array([  1.00000e-06,   3.59381e-06,   1.29155e-05,   4.64159e-05,
         1.66810e-04,   5.99484e-04,   2.15443e-03,   7.74264e-03,
         2.78256e-02,   1.00000e-01])},
       pre_dispatch='2*n_jobs', refit=True, score_func=None, verbose=0)

In [122]: clf.best_score
/usr/lib/python2.7/site-packages/sklearn/utils/__init__.py:79: DeprecationWarning: Function best_score is deprecated; GridSearchCV.best_score is deprecated and will be removed in version 0.12. Please use ``GridSearchCV.best_score_`` instead.
  warnings.warn(msg, category=DeprecationWarning)
Out[122]: 0.98600097103091122

In [123]: clf.best_estimator.gamma
/usr/lib/python2.7/site-packages/sklearn/utils/__init__.py:79: DeprecationWarning: Function best_estimator is deprecated; GridSearchCV.best_estimator is deprecated and will be removed in version 0.12. Please use ``GridSearchCV.best_estimator_`` instead.
  warnings.warn(msg, category=DeprecationWarning)
Out[123]: 0.0021544346900318843

預設GridSearchCV使用三次(3-fold)交叉驗證。然而，如果它探測到一個分類器被傳遞，而不是一個迴歸量，它使用分層的3次。

交叉驗證估計器

交叉驗證在一個algorithm by algorithm基礎上可以更有效地設定引數。這就是為何，對給定的估計器，scikit-learn使用“CV”估計器，通過交叉驗證自動設定引數。

In [125]: from sklearn import linear_model, datasets

In [126]: lasso = linear_model.LassoCV()

In [127]: diabetes = datasets.load_diabetes()

In [128]: X_diabetes = diabetes.data

In [129]: y_diabetes = diabetes.target

In [130]: lasso.fit(X_diabetes, y_diabetes)
Out[130]: 
LassoCV(alphas=array([ 2.14804,  2.00327, ...,  0.0023 ,  0.00215]),
    copy_X=True, cv=None, eps=0.001, fit_intercept=True, max_iter=1000,
    n_alphas=100, normalize=False, precompute='auto', tol=0.0001,
    verbose=False)

In [131]: # The estimator chose automatically its lambda:

In [132]: lasso.alpha 
Out[132]: 0.013180196198701137

這些估計器是相似的，以‘CV’為它們名字的字尾。

練習

對糖尿病資料集，找到最優的正則化引數alpha。(0.016249161908773888)