『應用機器學習的建議』的學習筆記

mathshelly發表於2015-04-12

這篇文章是以Bremen大學機器學習課程的教程為基礎的。本文總結了使用機器學習解決新問題的一些建議。包括：

視覺化資料的方法
選擇一個適合當前問題的機器學習方法
鑑別和解決過擬合和欠擬合問題
處理大資料庫問題（注意：不是非常小的）
不同損失函式的利弊

本文以Andrew Ng的《應用機器學習的建議 | Advice for applying Machine Learning》為基礎。這個筆記的目的是用一個互動的方法解釋這些觀點。有些建議是可以討論的。它們僅是建議，不是嚴格的規則。

In [1]:

import time
import numpy as np
np.random.seed(0)

import time

import numpy as np

np.random.seed(0)

In [2]:

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import matplotlib.pyplot as plt

import seaborn as sns

%matplotlib inline

In [3] :

#Modified from http://scikit-learn.org/stable/auto_examples/plot_learning_curve.html

From sklearn.learning_curve import learning_curve
Def plot_learning_curve(estimator, title, x, ylim=None, cv=None, train_sizes=np.linspace(.1,1.0,5)):
Generate a simple plot of the test and train learning curve.
Parameters
----------------
estimator:object type that implements the “fit” and “predict” methods
An object of that type which is cloned for each validation.

title : string
Title for the chart.

x : array-like, shape(n_samples, n_features)
Training vector, where n_samples is the number of samples and n_features is the number of features

y : array-like, shape (n_samples) or (n_samples, n_features)
Target relative to X for classification or regression;
None for unsupervised learning.

ylim : tuple, shape(ymin, ymax), optional
Defines minimum and maximum yvalues plotted.

cv : integer, cross-validation generator, optional
If an integer is passed, it is the number of folds (defaults to 3).
Specific cross-validation objects can be passed, see sklearn.cross_validation module for the list of possible objects
‘’’’’’

plt.figure()
train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=5, n_jobs=1, train_sizes = train_sizes)
train_scores_mean = np.mean(train_scores, axis = 1)
train_scores_std = np.std(train_scores, axis = 1)
test_scores_mean = np.mean(test_scores, axis = 1)
test_scores_std = np.std(test_scores, axis = 1)

plt.fill_between(train_sizes, train_scores_mean – train_scores_std, train_scores_mean + train_scores_std, alpha = 0.1, color = “r”)
plt.fill_between(train_sizes, test_scores_mean – test_scores_std, test_scores_mean + test_scores_std, alpha = 0.1, color = “g”)
plt.plot(train_szies, train_scores_mean, ‘o-’, color = “r”, label = “Training score”)
plt.plot(train_szies, test_scores_mean, ‘o-’, color = “g”, label = “Cross-validation score”)
plt.xlabel(“Training examples”)
plt.ylabel(“Score”)
plt.legend(loc=”best”)
plt.grid(“on”)
if ylim:
plt.ylim(ylim)
plt.title(title)

#Modified from http://scikit-learn.org/stable/auto_examples/plot_learning_curve.html

From sklearn.learning_curve import learning_curve

Def plot_learning_curve(estimator, title, x, ylim=None, cv=None, train_sizes=np.linspace(.1,1.0,5)):

Generate a simple plot of the test and train learning curve.

Parameters

----------------

estimator:object type that implements the “fit” and “predict” methods

An object of that type which is cloned for each validation.

title : string

Title for the chart.

x : array-like, shape(n_samples, n_features)

Training vector, where n_samples is the number of samples and n_features is the number of features

y : array-like, shape (n_samples) or (n_samples, n_features)

Target relative to X for classification or regression;

None for unsupervised learning.

ylim : tuple, shape(ymin, ymax), optional

Defines minimum and maximum yvalues plotted.

cv : integer, cross-validation generator, optional

If an integer is passed, it is the number of folds (defaults to 3).

Specific cross-validation objects can be passed, see sklearn.cross_validation module for the list of possible objects

‘’’’’’

plt.figure()

train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=5, n_jobs=1, train_sizes = train_sizes)

train_scores_mean = np.mean(train_scores, axis = 1)

train_scores_std = np.std(train_scores, axis = 1)

test_scores_mean = np.mean(test_scores, axis = 1)

test_scores_std = np.std(test_scores, axis = 1)

plt.fill_between(train_sizes, train_scores_mean – train_scores_std, train_scores_mean + train_scores_std, alpha = 0.1, color = “r”)

plt.fill_between(train_sizes, test_scores_mean – test_scores_std, test_scores_mean + test_scores_std, alpha = 0.1, color = “g”)

plt.plot(train_szies, train_scores_mean, ‘o-’, color = “r”, label = “Training score”)

plt.plot(train_szies, test_scores_mean, ‘o-’, color = “g”, label = “Cross-validation score”)

plt.xlabel(“Training examples”)

plt.ylabel(“Score”)

plt.legend(loc=”best”)

plt.grid(“on”)

if ylim:

plt.ylim(ylim)

plt.title(title)

資料集

我們使用sklearn的make_classification函式來生成一些簡單的玩具資料：

In [4] :

from sklearn.datasets import make_classification
X, y = make_classification(1000, n_features=20, n_informative=2, 
                           n_redundant=2, n_classes=2, random_state=0)

from pandas import DataFrame
df = DataFrame(np.hstack((X, y[:, None])), 
               columns = range(20) + ["class"])

from sklearn.datasets import make_classification

X, y = make_classification(1000, n_features=20, n_informative=2,

n_redundant=2, n_classes=2, random_state=0)

from pandas import DataFrame

df = DataFrame(np.hstack((X, y[:, None])),

columns = range(20) + ["class"])

注意到我們為二分類生成了一個資料集，這個資料集包括1000個資料點，每個特徵20維。我們已經使用pandas的DataFrame類把資料和類別封裝到一個共同的資料結構中。我們來看一看前5個資料點：

In [5]:

df[:5]

df[:5]

Out[5]:

0 1 2 3 4 5 6 7 8 9 ... 11 12 13 14 15 16 17 18 19 class
0 -1.063780 0.676409 1.069356 -0.217580 0.460215 -0.399167 -0.079188 1.209385 -0.785315 -0.172186 ... -0.993119 0.306935 0.064058 -1.054233 -0.527496 -0.074183 -0.355628 1.057214 -0.902592 0
1 0.070848 -1.695281 2.449449 -0.530494 -0.932962 2.865204 2.435729 -1.618500 1.300717 0.348402 ... 0.225324 0.605563 -0.192101 -0.068027 0.971681 -1.792048 0.017083 -0.375669 -0.623236 1
2 0.940284 -0.492146 0.677956 -0.227754 1.401753 1.231653 -0.777464 0.015616 1.331713 1.084773 ... -0.050120 0.948386 -0.173428 -0.477672 0.760896 1.001158 -0.069464 1.359046 -1.189590 1
3 -0.299517 0.759890 0.182803 -1.550233 0.338218 0.363241 -2.100525 -0.438068 -0.166393 -0.340835 ... 1.178724 2.831480 0.142414 -0.202819 2.405715 0.313305 0.404356 -0.287546 -2.847803 1
4 -2.630627 0.231034 0.042463 0.478851 1.546742 1.637956 -1.532072 -0.734445 0.465855 0.473836 ... -1.061194 -0.888880 1.238409 -0.572829 -1.275339 1.003007 -0.477128 0.098536 0.527804 0

0 1 2 3 4 5 6 7 8 9 ... 11 12 13 14 15 16 17 18 19 class

0 -1.063780 0.676409 1.069356 -0.217580 0.460215 -0.399167 -0.079188 1.209385 -0.785315 -0.172186 ... -0.993119 0.306935 0.064058 -1.054233 -0.527496 -0.074183 -0.355628 1.057214 -0.902592 0

1 0.070848 -1.695281 2.449449 -0.530494 -0.932962 2.865204 2.435729 -1.618500 1.300717 0.348402 ... 0.225324 0.605563 -0.192101 -0.068027 0.971681 -1.792048 0.017083 -0.375669 -0.623236 1

2 0.940284 -0.492146 0.677956 -0.227754 1.401753 1.231653 -0.777464 0.015616 1.331713 1.084773 ... -0.050120 0.948386 -0.173428 -0.477672 0.760896 1.001158 -0.069464 1.359046 -1.189590 1

3 -0.299517 0.759890 0.182803 -1.550233 0.338218 0.363241 -2.100525 -0.438068 -0.166393 -0.340835 ... 1.178724 2.831480 0.142414 -0.202819 2.405715 0.313305 0.404356 -0.287546 -2.847803 1

4 -2.630627 0.231034 0.042463 0.478851 1.546742 1.637956 -1.532072 -0.734445 0.465855 0.473836 ... -1.061194 -0.888880 1.238409 -0.572829 -1.275339 1.003007 -0.477128 0.098536 0.527804 0

通過直接檢視原始特徵值，我們很難獲得該問題的任何線索，即使在這個低維的例子中。因此，有很多的提供資料的更容易檢視的方法；其中的小部分將在接下來的部分中討論。

視覺化

當你接到一個新的問題，第一步幾乎都是視覺化，也就是說，觀察你的資料。

Seaborn是一個不錯的統計資料視覺化包。我們使用它的一些函式來探索資料。

第一步是使用pairplot生成散點圖和直方圖。兩種顏色對應了兩個類別，我們使用了特徵的一個子集、僅僅使用前50個資料點來簡化問題。

In [6] :

_ = sns.pairplot(df[:50], vars=[8, 11, 12, 14, 19], hue="class", size=1.5)

1	_ = sns.pairplot(df[:50], vars=[8, 11, 12, 14, 19], hue="class", size=1.5)

基於該直方圖，我們可以看到一些特徵比其他特徵對分類更有用。特別地，特徵11和14看起來有豐富的資訊量。這兩個特徵的散點圖顯示類別在二維空間中幾乎是線性可分的。要更加註意的是，特徵12和19是高度負相關的。我們可以通過使用corrplot更系統地探索相關性：

In [7] :

plt.figure(figsize=(12, 10))
_ = sns.corrplot(df, annot=False)

1 2	plt.figure(figsize=(12, 10)) _ = sns.corrplot(df, annot=False)

我們可以發現我們之前的觀察結果在這裡得到了確認：特徵11和14與類強相關（他們有豐富的資訊量）。更進一步，特徵12和特徵19強負相關，特徵19和特徵14強相關。因此，有一些特徵是冗餘的。這對於有些分類器可能會出現問題，比如，樸素貝葉斯，它假設所有的特徵都是獨立的。剩下的特徵大部分都是噪聲，他們既不相互關聯，也不和類別相關。

注意到如果特徵維數較大、資料點較少的時候，資料視覺化會變得更有挑戰性。我們在後面會給出一個高維資料視覺化的例子。

方法的選擇

一旦我們已經使用視覺化方法對資料進行了探索，我們就可以開始應用機器學習了。機器學習方法數量眾多，通常很難決定先嚐試哪種方法。這個簡單的備忘單（歸功於Andreas Müller和sklearn團隊）可以幫助你為你的問題選擇一個合適的機器學習方法（供選擇的備忘錄見
http://dlib.net/ml_guide.svg）

In [8] :

from IPython.display import Image
Image(filename='ml_map.png', width=800, height=600)

1 2	from IPython.display import Image Image(filename='ml_map.png', width=800, height=600)

Out[8] :

我們有了1000個樣本，要預測一個類別，並且有了標籤，那麼備忘單推薦我們首先使用LinearSVC（LinearSVC代表線性核的支援向量分類，並且對於這類特殊問題使用一個有效的演算法）。所有我們做了個試驗。LinearSVC需要選擇正則化；我們使用標準L2範數懲罰和C=10.我們分別畫出訓練分數和驗證分數的學習曲線（這個例子中分數代表準確率）：

In [9] :

from sklearn.svm import LinearSVC
plot_learning_curve(LinearSVC(C=10.0), "LinearSVC(C=10.0)",
X, y, ylim=(0.8, 1.01),
train_sizes=np.linspace(.05, 0.2, 5))

from sklearn.svm import LinearSVC

plot_learning_curve(LinearSVC(C=10.0), "LinearSVC(C=10.0)",

X, y, ylim=(0.8, 1.01),

train_sizes=np.linspace(.05, 0.2, 5))

我們可以注意到訓練資料和交叉驗證資料的錯誤率有很大的差距。這意味什麼？我們可能過度擬合訓練資料了！

解決過擬合

有很多方法來減少過擬合：

增加訓練樣本數

（獲得更多的資料是機器學習從業者的共同願望）

In [10] :

plot_learning_curve(LinearSVC(C=10.0), "LinearSVC(C=10.0)",
                    X, y, ylim=(0.8, 1.1),
                    train_sizes=np.linspace(.1, 1.0, 5))

plot_learning_curve(LinearSVC(C=10.0), "LinearSVC(C=10.0)",

X, y, ylim=(0.8, 1.1),

train_sizes=np.linspace(.1, 1.0, 5))

可以看到當訓練資料增加時，驗證分數越來越大，差距越來越小；因此現在不再過擬合了。有很多獲得更多資料的方法，比如（a）可以盡力投資收集更多資料，（b）基於現有資料創造一些人為的資料（比如影象旋轉，平移，扭曲），或者（c）加入人工噪聲。

如果以上的這些方法都不可行，就不可能獲得更多的資料，我們或者可以

減少特徵的維數

（從我們視覺化中可以知道，特徵11和14是資訊量最大的）

In [11] :

plot_learning_curve(LinearSVC(C=10.0), "LinearSVC(C=10.0) Features: 11&14",
                    X[:, [11, 14]], y, ylim=(0.8, 1.0),
                    train_sizes=np.linspace(.05, 0.2, 5))

plot_learning_curve(LinearSVC(C=10.0), "LinearSVC(C=10.0) Features: 11&14",

X[:, [11, 14]], y, ylim=(0.8, 1.0),

train_sizes=np.linspace(.05, 0.2, 5))

注意到，因為我們是手動的挑選特徵，而且在比我們給分類器更多的資料上，這有一點作弊的意味。我們可以使用自動挑選特徵：

In [12] :

from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, f_classif
# SelectKBest(f_classif, k=2) will select the k=2 best features according to their Anova F-value

plot_learning_curve(Pipeline([("fs", SelectKBest(f_classif, k=2)), # select two features
                               ("svc", LinearSVC(C=10.0))]),
                    "SelectKBest(f_classif, k=2) + LinearSVC(C=10.0)",
                    X, y, ylim=(0.8, 1.0),
                    train_sizes=np.linspace(.05, 0.2, 5))

from sklearn.pipeline import Pipeline

from sklearn.feature_selection import SelectKBest, f_classif

# SelectKBest(f_classif, k=2) will select the k=2 best features according to their Anova F-value

plot_learning_curve(Pipeline([("fs", SelectKBest(f_classif, k=2)), # select two features

("svc", LinearSVC(C=10.0))]),

"SelectKBest(f_classif, k=2) + LinearSVC(C=10.0)",

X, y, ylim=(0.8, 1.0),

train_sizes=np.linspace(.05, 0.2, 5))

這樣做效果非常好。在這個toy資料集上，特徵選擇是簡單的。應該注意到特徵選擇只是減少模型複雜度的一個特殊種類。其他的方法是：（a）減少線性迴歸多項式模型的次數，（b）減少人工神經網路節點的個數/層數，（c）增加RBF核的頻寬等等。

仍然有一個問題：為什麼分類器不能自動的識別有用的特徵？首先讓我們轉向另一種選擇，來減少過擬合：

增加分類器的正則化

（減少線性SVC的C的係數）

In [13] :

plot_learning_curve(LinearSVC(C=0.1), "LinearSVC(C=0.1)", 
                    X, y, ylim=(0.8, 1.0),
                    train_sizes=np.linspace(.05, 0.2, 5))

plot_learning_curve(LinearSVC(C=0.1), "LinearSVC(C=0.1)",

X, y, ylim=(0.8, 1.0),

train_sizes=np.linspace(.05, 0.2, 5))

這已經有一點點作用了。我們也可以使用基於交叉驗證的網格搜尋自動地挑選分類器的正則化：

In [14] :

from sklearn.grid_search import GridSearchCV
est = GridSearchCV(LinearSVC(), 
                   param_grid={"C": [0.001, 0.01, 0.1, 1.0, 10.0]})
plot_learning_curve(est, "LinearSVC(C=AUTO)", 
                    X, y, ylim=(0.8, 1.0),
                    train_sizes=np.linspace(.05, 0.2, 5))
print "Chosen parameter on 100 datapoints: %s" % est.fit(X[:100], y[:100]).best_params_

from sklearn.grid_search import GridSearchCV

est = GridSearchCV(LinearSVC(),

param_grid={"C": [0.001, 0.01, 0.1, 1.0, 10.0]})

plot_learning_curve(est, "LinearSVC(C=AUTO)",

X, y, ylim=(0.8, 1.0),

train_sizes=np.linspace(.05, 0.2, 5))

print "Chosen parameter on 100 datapoints: %s" % est.fit(X[:100], y[:100]).best_params_

在100個資料點上選擇引數：{‘C’: 0.01}

一般說來，特徵選擇似乎更好。分類器可以自動識別有用的特徵嗎？回想一下，LinearSVC還支援L1範數懲罰，這產生了一個稀疏的解決方案。稀疏解決方案對應一個隱式的特徵選擇。讓我們來試試這個：

In [15] :

plot_learning_curve(LinearSVC(C=0.1, penalty='l1', dual=False), 
                    "LinearSVC(C=0.1, penalty='l1')", 
                    X, y, ylim=(0.8, 1.0),
                    train_sizes=np.linspace(.05, 0.2, 5))

plot_learning_curve(LinearSVC(C=0.1, penalty='l1', dual=False),

"LinearSVC(C=0.1, penalty='l1')",

X, y, ylim=(0.8, 1.0),

train_sizes=np.linspace(.05, 0.2, 5))

這看起來也很好。讓我們來探討學到的係數：

In [16] :

est = LinearSVC(C=0.1, penalty='l1', dual=False)
est.fit(X[:150], y[:150])  # fit on 150 datapoints
print "Coefficients learned: %s" % est.coef_
print "Non-zero coefficients: %s" % np.nonzero(est.coef_)[1]

est = LinearSVC(C=0.1, penalty='l1', dual=False)

est.fit(X[:150], y[:150]) # fit on 150 datapoints

print "Coefficients learned: %s" % est.coef_

print "Non-zero coefficients: %s" % np.nonzero(est.coef_)[1]

Coefficients learned: [[ 0. 0. 0. 0. 0. 0.01857999 0. 0. 0. 0.004135 0. 1.05241369 0.01971419 0. 0. 0. 0. -0.05665314 0.14106505 0. ]] Non-zero coefficients: [ 5 9 11 12 17 18]

大部分系數是0（對應的特徵被忽略），並且目前最大的權重在特徵11上。

不同的資料集

我們生成另外一個二分類的資料集，並且再次應用LinearSVC。

In [17]:

from sklearn.datasets import make_circles
X, y = make_circles(n_samples=1000, random_state=2)

1 2	from sklearn.datasets import make_circles X, y = make_circles(n_samples=1000, random_state=2)

In [18]:

plot_learning_curve(LinearSVC(C=0.25), "LinearSVC(C=0.25)", 
                    X, y, ylim=(0.5, 1.0),
                    train_sizes=np.linspace(.1, 1.0, 5))

plot_learning_curve(LinearSVC(C=0.25), "LinearSVC(C=0.25)",

X, y, ylim=(0.5, 1.0),

train_sizes=np.linspace(.1, 1.0, 5))

啊，這非常糟糕，甚至訓練誤差都不如隨機誤差。這個可能的原因是什麼？難道上面的所有方法（更多資料，特徵選擇，增加正則化）都不奏效了嗎？

結果是：No。我們處在一個完全不同的情況：以前，訓練分數一直接近完美，我們不得不解決過擬合。這次，訓練誤差也非常低。是欠擬合。讓我們來看一看資料：

In [19] :

df = DataFrame(np.hstack((X, y[:, None])), 
               columns = range(2) + ["class"])
_ = sns.pairplot(df, vars=[0, 1], hue="class", size=3.5)

df = DataFrame(np.hstack((X, y[:, None])),

columns = range(2) + ["class"])

_ = sns.pairplot(df, vars=[0, 1], hue="class", size=3.5)

這些資料顯然不是線性可分的；更多的資料或者更少的特徵沒有用了。我們的模型錯了；因此欠擬合。

解決欠擬合

減少欠擬合的方法：

使用更多或更好的特徵（到原點的距離應該有用！）

In [20] :

# add squared distance from origin as third feature
X_extra = np.hstack((X, X[:, [0]]**2 + X[:, [1]]**2))

plot_learning_curve(LinearSVC(C=0.25), "LinearSVC(C=0.25) + distance feature", 
                    X_extra, y, ylim=(0.5, 1.0),
                    train_sizes=np.linspace(.1, 1.0, 5))

# add squared distance from origin as third feature

X_extra = np.hstack((X, X[:, [0]]**2 + X[:, [1]]**2))

plot_learning_curve(LinearSVC(C=0.25), "LinearSVC(C=0.25) + distance feature",

X_extra, y, ylim=(0.5, 1.0),

train_sizes=np.linspace(.1, 1.0, 5))

非常好！但是我們必須要花一些心思來想出這些特徵。或許分類器可以自動的做到這些？這需要

使用更復雜的模型

（減少正則化或非線性核）

In [21] :

from sklearn.svm import SVC
# note: we use the original X without the extra feature
plot_learning_curve(SVC(C=2.5, kernel="rbf", gamma=1.0),
                    "SVC(C=2.5, kernel='rbf', gamma=1.0)",
                    X, y, ylim=(0.5, 1.0), 
                    train_sizes=np.linspace(.1, 1.0, 5))

from sklearn.svm import SVC

# note: we use the original X without the extra feature

plot_learning_curve(SVC(C=2.5, kernel="rbf", gamma=1.0),

"SVC(C=2.5, kernel='rbf', gamma=1.0)",

X, y, ylim=(0.5, 1.0),

train_sizes=np.linspace(.1, 1.0, 5))

是的，這也可以達到滿意的效果！

更大的資料集和更高維的特徵空間

回到原始的資料集上，但是這次有更多的特徵和樣本，並且有5類。LinearSVC在這樣大小的資料集上會有一點慢；備忘單上建議使用SGDClassifier。這個分類器學習到一個線性模型（就像LinearSVC或logistic迴歸），但是它在訓練中使用隨機梯度下降（就像反向傳播的人工神經網路一樣）。

SGDClassifier允許小批量掃描資料，這對於資料量太大不能放到記憶體中時有幫助。交叉驗證和這項技術不相容；使用逐步驗證代替：這裡，估計器總是在訓練資料集的下一塊上進行測試（在用它進行訓練之前）。訓練之後，會再次進行測試來檢查它適應資料的能力。

In [22] :

X, y = make_classification(200000, n_features=200, n_informative=25, 
                           n_redundant=0, n_classes=10, class_sep=2,
                           random_state=0)

X, y = make_classification(200000, n_features=200, n_informative=25,

n_redundant=0, n_classes=10, class_sep=2,

random_state=0)

In [23]:

from sklearn.linear_model import SGDClassifier
est = SGDClassifier(penalty="l2", alpha=0.001)
progressive_validation_score = []
train_score = []
for datapoint in range(0, 199000, 1000):
    X_batch = X[datapoint:datapoint+1000]
    y_batch = y[datapoint:datapoint+1000]
    if datapoint > 0:
        progressive_validation_score.append(est.score(X_batch, y_batch))
    est.partial_fit(X_batch, y_batch, classes=range(10))
    if datapoint > 0:
        train_score.append(est.score(X_batch, y_batch))

plt.plot(train_score, label="train score")
plt.plot(progressive_validation_score, label="progressive validation score")
plt.xlabel("Mini-batch")
plt.ylabel("Score")
plt.legend(loc='best')

from sklearn.linear_model import SGDClassifier

est = SGDClassifier(penalty="l2", alpha=0.001)

progressive_validation_score = []

train_score = []

for datapoint in range(0, 199000, 1000):

X_batch = X[datapoint:datapoint+1000]

y_batch = y[datapoint:datapoint+1000]

if datapoint > 0:

progressive_validation_score.append(est.score(X_batch, y_batch))

est.partial_fit(X_batch, y_batch, classes=range(10))

if datapoint > 0:

train_score.append(est.score(X_batch, y_batch))

plt.plot(train_score, label="train score")

plt.plot(progressive_validation_score, label="progressive validation score")

plt.xlabel("Mini-batch")

plt.ylabel("Score")

plt.legend(loc='best')

Out[23]:

<matplotlib.legend.Legend at 0x7f6a24e2dfd0>

1	<matplotlib.legend.Legend at 0x7f6a24e2dfd0>

這個圖告訴我們，在50個mini-batches的資料之後，我們已經不能再提高驗證資料了，因此可以停止訓練了。由於訓練分數不是很高，我們可能是欠擬合而不是過擬合。要是使用rbf核測試一下就更好了，但是SGDClassifier很不幸的不相容核技巧。替代方法是可以使用一個多層的感知機，它也可以使用隨機梯度下降進行訓練，但是一個非線性模型，或者像備忘單建議的，使用核近似法。

現在在一個機器學習中使用的經典的解決光學字元識別的資料集上：

In [24]:

from sklearn.datasets import load_digits
digits = load_digits(n_class=6)
X = digits.data
y = digits.target
n_samples, n_features = X.shape
print "Dataset consist of %d samples with %d features each" % (n_samples, n_features)

# Plot images of the digits
n_img_per_row = 20
img = np.zeros((10 * n_img_per_row, 10 * n_img_per_row))
for i in range(n_img_per_row):
    ix = 10 * i + 1
    for j in range(n_img_per_row):
        iy = 10 * j + 1
        img[ix:ix + 8, iy:iy + 8] = X[i * n_img_per_row + j].reshape((8, 8))

plt.imshow(img, cmap=plt.cm.binary)
plt.xticks([])
plt.yticks([])
_ = plt.title('A selection from the 8*8=64-dimensional digits dataset')

from sklearn.datasets import load_digits

digits = load_digits(n_class=6)

X = digits.data

y = digits.target

n_samples, n_features = X.shape

print "Dataset consist of %d samples with %d features each" % (n_samples, n_features)

# Plot images of the digits

n_img_per_row = 20

img = np.zeros((10 * n_img_per_row, 10 * n_img_per_row))

for i in range(n_img_per_row):

ix = 10 * i + 1

for j in range(n_img_per_row):

iy = 10 * j + 1

img[ix:ix + 8, iy:iy + 8] = X[i * n_img_per_row + j].reshape((8, 8))

plt.imshow(img, cmap=plt.cm.binary)

plt.xticks([])

plt.yticks([])

_ = plt.title('A selection from the 8*8=64-dimensional digits dataset')

由1083個樣本組成的資料集，每個樣本由64個特徵組成

因此我們有1083個手寫數字（0，1，2，3，4，5）樣本，每一個樣本由8*8的4bit畫素（0，16）灰度圖片組成。因此特徵的維數適中（64）；但是，這64維空間的視覺化是非常重要的。我們來說明不同的減少維數（至二維）方法，基於http://scikit-learn.org/stable/auto_examples/manifold/plot_lle_digits.html#example-manifold-plot-lle-digits-py

In [25]:

# Helper function based on 
# http://scikit-learn.org/stable/auto_examples/manifold/plot_lle_digits.html#example-manifold-plot-lle-digits-py
from matplotlib import offsetbox
def plot_embedding(X, title=None):
    x_min, x_max = np.min(X, 0), np.max(X, 0)
    X = (X - x_min) / (x_max - x_min)

    plt.figure(figsize=(10, 10))
    ax = plt.subplot(111)
    for i in range(X.shape[0]):
        plt.text(X[i, 0], X[i, 1], str(digits.target[i]),
                 color=plt.cm.Set1(y[i] / 10.),
                 fontdict={'weight': 'bold', 'size': 12})

    if hasattr(offsetbox, 'AnnotationBbox'):
        # only print thumbnails with matplotlib > 1.0
        shown_images = np.array([[1., 1.]])  # just something big
        for i in range(digits.data.shape[0]):
            dist = np.sum((X[i] - shown_images) ** 2, 1)
            if np.min(dist) < 4e-3:
                # don't show points that are too close
                continue
            shown_images = np.r_[shown_images, [X[i]]]
            imagebox = offsetbox.AnnotationBbox(
                offsetbox.OffsetImage(digits.images[i], cmap=plt.cm.gray_r),
                X[i])
            ax.add_artist(imagebox)
    plt.xticks([]), plt.yticks([])
    if title is not None:
        plt.title(title)

# Helper function based on

# http://scikit-learn.org/stable/auto_examples/manifold/plot_lle_digits.html#example-manifold-plot-lle-digits-py

from matplotlib import offsetbox

def plot_embedding(X, title=None):

x_min, x_max = np.min(X, 0), np.max(X, 0)

X = (X - x_min) / (x_max - x_min)

plt.figure(figsize=(10, 10))

ax = plt.subplot(111)

for i in range(X.shape[0]):

plt.text(X[i, 0], X[i, 1], str(digits.target[i]),

color=plt.cm.Set1(y[i] / 10.),

fontdict={'weight': 'bold', 'size': 12})

if hasattr(offsetbox, 'AnnotationBbox'):

# only print thumbnails with matplotlib > 1.0

shown_images = np.array([[1., 1.]]) # just something big

for i in range(digits.data.shape[0]):

dist = np.sum((X[i] - shown_images) ** 2, 1)

if np.min(dist) < 4e-3:

# don't show points that are too close

continue

shown_images = np.r_[shown_images, [X[i]]]

imagebox = offsetbox.AnnotationBbox(

offsetbox.OffsetImage(digits.images[i], cmap=plt.cm.gray_r),

X[i])

ax.add_artist(imagebox)

plt.xticks([]), plt.yticks([])

if title is not None:

plt.title(title)

已經隨機投影的二維資料的結果不是太差：

In [26]:

from sklearn import (manifold, decomposition, random_projection)
rp = random_projection.SparseRandomProjection(n_components=2, random_state=42)
stime = time.time()
X_projected = rp.fit_transform(X)
plot_embedding(X_projected, "Random Projection of the digits (time: %.3fs)" % (time.time() - stime))

from sklearn import (manifold, decomposition, random_projection)

rp = random_projection.SparseRandomProjection(n_components=2, random_state=42)

stime = time.time()

X_projected = rp.fit_transform(X)

plot_embedding(X_projected, "Random Projection of the digits (time: %.3fs)" % (time.time() - stime))

然而，有一個很著名的方法一般來說應該適合，也就是PCA（使用TruncatedSVD來實現，不需要構建協方差矩陣）：

In [27]:

X_pca = decomposition.TruncatedSVD(n_components=2).fit_transform(X)
stime = time.time()
plot_embedding(X_pca,
               "Principal Components projection of the digits (time: %.3fs)" % (time.time() - stime))

X_pca = decomposition.TruncatedSVD(n_components=2).fit_transform(X)

stime = time.time()

plot_embedding(X_pca,

"Principal Components projection of the digits (time: %.3fs)" % (time.time() - stime))

PCA給出一個更好的結果，而且在這個資料集上甚至更快。通過允許64維輸入空間到二維目標空間的非線性變換，我們可以得到更好的結果。這有很多種方法；我們這裡只介紹一種方法：t-SNE。

In [28]:

tsne = manifold.TSNE(n_components=2, init='pca', random_state=0)
stime = time.time()
X_tsne = tsne.fit_transform(X)
plot_embedding(X_tsne,
               "t-SNE embedding of the digits (time: %.3fs)" % (time.time() - stime))

tsne = manifold.TSNE(n_components=2, init='pca', random_state=0)

stime = time.time()

X_tsne = tsne.fit_transform(X)

plot_embedding(X_tsne,

"t-SNE embedding of the digits (time: %.3fs)" % (time.time() - stime))

這是一個非常優秀的嵌入，也表明只使用一個分類器完美地分開這些類是可能的（詳見例子http://scikit-learn.org/stable/auto_examples/plot_digits_classification.html）。t-SNE唯一的不足是它需要更多的時間來計算，因此不適用於大資料集（在目前的條件下）

損失函式的選擇

損失函式的選擇也非常重要。下面是不同損失函式的說明：
In [29]:

# adapted from http://scikit-learn.org/stable/auto_examples/linear_model/plot_sgd_loss_functions.html
xmin, xmax = -4, 4
xx = np.linspace(xmin, xmax, 100)
plt.plot([xmin, 0, 0, xmax], [1, 1, 0, 0], 'k-',
         label="Zero-one loss")
plt.plot(xx, np.where(xx < 1, 1 - xx, 0), 'g-',
         label="Hinge loss")
plt.plot(xx, np.log2(1 + np.exp(-xx)), 'r-',
         label="Log loss")
plt.plot(xx, np.exp(-xx), 'c-',
         label="Exponential loss")
plt.plot(xx, -np.minimum(xx, 0), 'm-',
         label="Perceptron loss")
# the balanced relative margin machine
#R = 2
#plt.plot(xx, np.where(xx < 1, 1 - xx, (np.where(xx > R, xx-R,0))), 'b-',
#         label="L1 Balanced Relative Margin Loss")
plt.ylim((0, 8))
plt.legend(loc="upper right")
plt.xlabel(r"Decision function $f(x)$")
plt.ylabel("$L(y, f(x))$")

# adapted from http://scikit-learn.org/stable/auto_examples/linear_model/plot_sgd_loss_functions.html

xmin, xmax = -4, 4

xx = np.linspace(xmin, xmax, 100)

plt.plot([xmin, 0, 0, xmax], [1, 1, 0, 0], 'k-',

label="Zero-one loss")

plt.plot(xx, np.where(xx < 1, 1 - xx, 0), 'g-',

label="Hinge loss")

plt.plot(xx, np.log2(1 + np.exp(-xx)), 'r-',

label="Log loss")

plt.plot(xx, np.exp(-xx), 'c-',

label="Exponential loss")

plt.plot(xx, -np.minimum(xx, 0), 'm-',

label="Perceptron loss")

# the balanced relative margin machine

#R = 2

#plt.plot(xx, np.where(xx < 1, 1 - xx, (np.where(xx > R, xx-R,0))), 'b-',

# label="L1 Balanced Relative Margin Loss")

plt.ylim((0, 8))

plt.legend(loc="upper right")

plt.xlabel(r"Decision function $f(x)$")

plt.ylabel("$L(y, f(x))$")

Out[29]:

<matplotlib.text.Text at 0x7f6a2879cf90>

1	<matplotlib.text.Text at 0x7f6a2879cf90>

不同的損失函式有不同的優勢：

0-1損失是在分類問題中你實際上需要的。不幸地是，這是非凸優化問題，由於最優化問題會變得或多或少的不好解決，因此並不實用。
合頁損失（使用支援向量分類）匯出一個在資料中稀疏的解（由於$f(x) > 1$，它變為0），而且對離群點比較穩健（由於$f(x)to-infty$，它僅僅成線性增長）。它不提供充分的校準的概率。
對數損失函式（比如，在邏輯迴歸中使用）匯出很好的概率校準。因此，如果你不僅得到二值預測，還可以得出結果的概率，這個損失函式是一個很好的選擇。缺點是，它的解在資料空間中是不稀疏的，它比合頁損失函式更容易受到離群點的影響。
指數損失函式（在Adaboost中使用）非常容易受離群點的影響（由於當$f(x)to-infty$時它快速增加）。它主要適用於Adaboost中，因為它在一個簡單有效的boosting演算法中有效果。
感知器損失函式基本上是合頁損失函式的移動版本。合頁損失函式也懲罰非常接近邊界但是在正確一邊的點（間隔最大化準則）。另一方面，感知器損失函式只要資料點在邊界正確的一邊就可以，如果資料是線性可分就使得邊界待定，導致比間隔最大化更差的泛化性。

總結

以上我們討論了一些怎麼讓機器學習在一個新的問題上工作起來的建議。我們考慮了分類問題，迴歸和聚類問題也與之類似。然而，專注於人工資料集（為了便於理解）還有點過於簡單化。在很多實際問題中，資料的收集、組織、預處理是極重要的。請參見本文中data wrangling的例子。Pandas是這方面很好的工具。

很多應用領域也有具體要求，也有符合這些要求的工具，比如：

使用skimage圖片處理
使用pySPACE的生物訊號分析和一般時間序列處理
使用pandas處理財務資料

我們不詳細探索這些領域；然而，尋找一個好的預處理流程往往比選擇一個合適的分類器需要付出更大的努力。我們可以通過一個例子初識一箇中等複雜的訊號處理流程，該例中使用pySPACE在腦電波資料中檢測特定事件相關電位：
https://github.com/pyspace/pyspace/blob/master/docs/examples/specs/node_chains/ref_P300_flow.yaml

訊號處理流程包含資料標準化，抽取，帶通濾波，降維（xDAWN是一個監督的降維方法），特徵提取（區域性直線特徵），和特徵標準化。下圖給出了pySPACE中分類之前可用的流程各部分的一個概貌。

In [30]:

Image(filename='algorithm_types_detailed.png', width=800, height=600)

1	Image(filename='algorithm_types_detailed.png', width=800, height=600)

Out[30]:

機器學習的一個長遠目標，也是深度學習領域的追求，是可以學習大部分這樣的流程，而不是手工編寫它們。

In [31]:

%load_ext watermark
%watermark -a "Jan Hendrik Metzen" -d -v -m -p numpy,scikit-learn

%load_ext watermark

%watermark -a "Jan Hendrik Metzen" -d -v -m -p numpy,scikit-learn

Jan Hendrik Metzen 29/01/2015

CPython 2.7.9
IPython 2.1.0

numpy 1.9.1
scikit-learn 0.14.1

compiler : GCC 4.4.7 20120313 (Red Hat 4.4.7-1)
system : Linux
release : 3.16.0-28-generic
machine : x86_64
processor : x86_64
CPU cores : 4
interpreter: 64bit

這篇文章是一篇IPython筆記。你可以下載該筆記。

吳恩達機器學習筆記 —— 11 應用機器學習的建議
2018-07-23
吳恩達機器學習筆記
機器學習建議
2019-03-26
機器學習
機器學習整合學習—Apple的學習筆記
2018-11-01
機器學習APP筆記
機器學習學習筆記
2021-06-01
機器學習筆記
Python機器學習筆記：sklearn庫的學習
2018-12-29
Python機器學習筆記
機器學習筆記
2024-08-25
機器學習筆記
Neo4j/cypher學習筆記與學習建議
2021-04-23
筆記
Scikit-Learn 與 TensorFlow 機器學習實用指南學習筆記2 — 機器學習的主要挑戰
2018-11-26
機器學習筆記
李巨集毅機器學習-學習筆記
2018-11-13
機器學習筆記
機器學習學習筆記——基本知識
2024-04-15
機器學習筆記
機器學習演算法學習筆記
2023-03-13
機器學習演算法筆記
吳恩達《Machine Learning》精煉筆記 6：關於機器學習的建議
2021-01-16
吳恩達Mac筆記機器學習
機器學習 | 吳恩達機器學習第九周學習筆記
2018-11-22
機器學習吳恩達筆記
飛機的 PHP 學習筆記十：應用技術
2020-01-31
PHP筆記
《機器學習初步》筆記
2024-10-07
機器學習筆記
五個給機器學習和資料科學入門者的學習建議
2019-09-16
機器學習資料科學
飛機的 PHP 學習筆記之應用技術篇
2020-01-31
PHP筆記
numpy的學習筆記\pandas學習筆記
2018-03-18
筆記
機器學習課程筆記
2018-05-15
機器學習筆記
學習筆記-虛擬機器
2020-11-01
筆記虛擬機
Machine Learning 機器學習筆記
2018-03-27
Mac機器學習筆記
吳恩達《構建機器學習專案》課程筆記（1）– 機器學習策略（上）
2018-07-31
吳恩達機器學習筆記
吳恩達《構建機器學習專案》課程筆記（2）– 機器學習策略（下）
2018-07-31
吳恩達機器學習筆記
給機器學習面試者的十項建議
2019-07-02
機器學習面試
Raft協議學習筆記
2019-01-23
Raft協議筆記
Raft 協議學習筆記
2018-06-05
Raft協議筆記
學習筆記 - DNS協議
2024-09-02
筆記DNS協議
IP協議學習筆記
2024-07-15
協議筆記
機器學習框架ML.NET學習筆記【9】自動學習
2019-06-10
機器學習框架筆記
要學很多數學嗎 - 給要入行機器學習的朋友們的建議
2022-05-18
機器學習
史丹佛大學-機器學習的動機與應用
2018-05-26
機器學習
圖文並茂，700 頁的機器學習筆記火了！值得學習
2020-06-15
機器學習筆記
吳恩達機器學習筆記 —— 18 大規模機器學習
2018-08-04
吳恩達機器學習筆記
【學習筆記】並查集應用
2024-07-30
筆記並查集
【《白話機器學習的數學》筆記1】迴歸
2022-02-03
機器學習筆記
機器學習演算法：Logistic迴歸學習筆記
2018-05-29
機器學習演算法筆記
AI學習筆記之——如何理解機器學習(Machine Learning)
2018-07-23
AI筆記機器學習Mac
this和super的區別和應用學習筆記
2021-09-09
筆記
Scikit-Learn 與 TensorFlow 機器學習實用指南學習筆記1 — 機器學習基礎知識簡介
2018-11-20
機器學習筆記

『應用機器學習的建議』的學習筆記

資料集

視覺化

方法的選擇

解決過擬合

不同的資料集

解決欠擬合

更大的資料集和更高維的特徵空間

損失函式的選擇

總結

相關文章