- 視覺化資料的方法
- 選擇一個適合當前問題的機器學習方法
- 鑑別和解決過擬合和欠擬合問題
- 處理大資料庫問題(注意:不是非常小的)
- 不同損失函式的利弊
本文以Andrew Ng的《應用機器學習的建議 | Advice for applying Machine Learning》為基礎。這個筆記的目的是用一個互動的方法解釋這些觀點。有些建議是可以討論的。它們僅是建議,不是嚴格的規則。
In [1]:
1 2 3 |
import time import numpy as np np.random.seed(0) |
In [2]:
1 2 3 |
import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline |
In [3] :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
#Modified from http://scikit-learn.org/stable/auto_examples/plot_learning_curve.html From sklearn.learning_curve import learning_curve Def plot_learning_curve(estimator, title, x, ylim=None, cv=None, train_sizes=np.linspace(.1,1.0,5)): Generate a simple plot of the test and train learning curve. Parameters ---------------- estimator:object type that implements the “fit” and “predict” methods An object of that type which is cloned for each validation. title : string Title for the chart. x : array-like, shape(n_samples, n_features) Training vector, where n_samples is the number of samples and n_features is the number of features y : array-like, shape (n_samples) or (n_samples, n_features) Target relative to X for classification or regression; None for unsupervised learning. ylim : tuple, shape(ymin, ymax), optional Defines minimum and maximum yvalues plotted. cv : integer, cross-validation generator, optional If an integer is passed, it is the number of folds (defaults to 3). Specific cross-validation objects can be passed, see sklearn.cross_validation module for the list of possible objects ‘’’’’’ plt.figure() train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=5, n_jobs=1, train_sizes = train_sizes) train_scores_mean = np.mean(train_scores, axis = 1) train_scores_std = np.std(train_scores, axis = 1) test_scores_mean = np.mean(test_scores, axis = 1) test_scores_std = np.std(test_scores, axis = 1) plt.fill_between(train_sizes, train_scores_mean – train_scores_std, train_scores_mean + train_scores_std, alpha = 0.1, color = “r”) plt.fill_between(train_sizes, test_scores_mean – test_scores_std, test_scores_mean + test_scores_std, alpha = 0.1, color = “g”) plt.plot(train_szies, train_scores_mean, ‘o-’, color = “r”, label = “Training score”) plt.plot(train_szies, test_scores_mean, ‘o-’, color = “g”, label = “Cross-validation score”) plt.xlabel(“Training examples”) plt.ylabel(“Score”) plt.legend(loc=”best”) plt.grid(“on”) if ylim: plt.ylim(ylim) plt.title(title) |
In [4] :
1 2 3 4 5 6 7 |
from sklearn.datasets import make_classification X, y = make_classification(1000, n_features=20, n_informative=2, n_redundant=2, n_classes=2, random_state=0) from pandas import DataFrame df = DataFrame(np.hstack((X, y[:, None])), columns = range(20) + ["class"]) |
In [5]:
1 |
df[:5] |
1 2 3 4 5 6 |
0 1 2 3 4 5 6 7 8 9 ... 11 12 13 14 15 16 17 18 19 class 0 -1.063780 0.676409 1.069356 -0.217580 0.460215 -0.399167 -0.079188 1.209385 -0.785315 -0.172186 ... -0.993119 0.306935 0.064058 -1.054233 -0.527496 -0.074183 -0.355628 1.057214 -0.902592 0 1 0.070848 -1.695281 2.449449 -0.530494 -0.932962 2.865204 2.435729 -1.618500 1.300717 0.348402 ... 0.225324 0.605563 -0.192101 -0.068027 0.971681 -1.792048 0.017083 -0.375669 -0.623236 1 2 0.940284 -0.492146 0.677956 -0.227754 1.401753 1.231653 -0.777464 0.015616 1.331713 1.084773 ... -0.050120 0.948386 -0.173428 -0.477672 0.760896 1.001158 -0.069464 1.359046 -1.189590 1 3 -0.299517 0.759890 0.182803 -1.550233 0.338218 0.363241 -2.100525 -0.438068 -0.166393 -0.340835 ... 1.178724 2.831480 0.142414 -0.202819 2.405715 0.313305 0.404356 -0.287546 -2.847803 1 4 -2.630627 0.231034 0.042463 0.478851 1.546742 1.637956 -1.532072 -0.734445 0.465855 0.473836 ... -1.061194 -0.888880 1.238409 -0.572829 -1.275339 1.003007 -0.477128 0.098536 0.527804 0 |
In [6] :
1 |
_ = sns.pairplot(df[:50], vars=[8, 11, 12, 14, 19], hue="class", size=1.5) |
In [7] :
1 2 |
plt.figure(figsize=(12, 10)) _ = sns.corrplot(df, annot=False) |
一旦我們已經使用視覺化方法對資料進行了探索,我們就可以開始應用機器學習了。機器學習方法數量眾多,通常很難決定先嚐試哪種方法。這個簡單的備忘單(歸功於Andreas Müller和sklearn團隊)可以幫助你為你的問題選擇一個合適的機器學習方法(供選擇的備忘錄見
In [8] :
1 2 |
from IPython.display import Image Image(filename='ml_map.png', width=800, height=600) |
Out[8] :
In [9] :
1 2 3 4 |
from sklearn.svm import LinearSVC plot_learning_curve(LinearSVC(C=10.0), "LinearSVC(C=10.0)", X, y, ylim=(0.8, 1.01), train_sizes=np.linspace(.05, 0.2, 5)) |
- 增加訓練樣本數
In [10] :
1 2 3 |
plot_learning_curve(LinearSVC(C=10.0), "LinearSVC(C=10.0)", X, y, ylim=(0.8, 1.1), train_sizes=np.linspace(.1, 1.0, 5)) |
- 減少特徵的維數
In [11] :
1 2 3 |
plot_learning_curve(LinearSVC(C=10.0), "LinearSVC(C=10.0) Features: 11&14", X[:, [11, 14]], y, ylim=(0.8, 1.0), train_sizes=np.linspace(.05, 0.2, 5)) |
In [12] :
1 2 3 4 5 6 7 8 9 |
from sklearn.pipeline import Pipeline from sklearn.feature_selection import SelectKBest, f_classif # SelectKBest(f_classif, k=2) will select the k=2 best features according to their Anova F-value plot_learning_curve(Pipeline([("fs", SelectKBest(f_classif, k=2)), # select two features ("svc", LinearSVC(C=10.0))]), "SelectKBest(f_classif, k=2) + LinearSVC(C=10.0)", X, y, ylim=(0.8, 1.0), train_sizes=np.linspace(.05, 0.2, 5)) |
- 增加分類器的正則化
In [13] :
1 2 3 |
plot_learning_curve(LinearSVC(C=0.1), "LinearSVC(C=0.1)", X, y, ylim=(0.8, 1.0), train_sizes=np.linspace(.05, 0.2, 5)) |
In [14] :
1 2 3 4 5 6 7 |
from sklearn.grid_search import GridSearchCV est = GridSearchCV(LinearSVC(), param_grid={"C": [0.001, 0.01, 0.1, 1.0, 10.0]}) plot_learning_curve(est, "LinearSVC(C=AUTO)", X, y, ylim=(0.8, 1.0), train_sizes=np.linspace(.05, 0.2, 5)) print "Chosen parameter on 100 datapoints: %s" % est.fit(X[:100], y[:100]).best_params_ |
在100個資料點上選擇引數:{‘C’: 0.01}
In [15] :
1 2 3 4 |
plot_learning_curve(LinearSVC(C=0.1, penalty='l1', dual=False), "LinearSVC(C=0.1, penalty='l1')", X, y, ylim=(0.8, 1.0), train_sizes=np.linspace(.05, 0.2, 5)) |
In [16] :
1 2 3 4 |
est = LinearSVC(C=0.1, penalty='l1', dual=False) est.fit(X[:150], y[:150]) # fit on 150 datapoints print "Coefficients learned: %s" % est.coef_ print "Non-zero coefficients: %s" % np.nonzero(est.coef_)[1] |
Coefficients learned: [[ 0. 0. 0. 0. 0. 0.01857999
0. 0. 0. 0.004135 0. 1.05241369
0.01971419 0. 0. 0. 0. -0.05665314
0.14106505 0. ]]
Non-zero coefficients: [ 5 9 11 12 17 18]
In [17]:
1 2 |
from sklearn.datasets import make_circles X, y = make_circles(n_samples=1000, random_state=2) |
In [18]:
1 2 3 |
plot_learning_curve(LinearSVC(C=0.25), "LinearSVC(C=0.25)", X, y, ylim=(0.5, 1.0), train_sizes=np.linspace(.1, 1.0, 5)) |
In [19] :
1 2 3 |
df = DataFrame(np.hstack((X, y[:, None])), columns = range(2) + ["class"]) _ = sns.pairplot(df, vars=[0, 1], hue="class", size=3.5) |
- 使用更多或更好的特徵(到原點的距離應該有用!)
In [20] :
1 2 3 4 5 6 |
# add squared distance from origin as third feature X_extra = np.hstack((X, X[:, [0]]**2 + X[:, [1]]**2)) plot_learning_curve(LinearSVC(C=0.25), "LinearSVC(C=0.25) + distance feature", X_extra, y, ylim=(0.5, 1.0), train_sizes=np.linspace(.1, 1.0, 5)) |
- 使用更復雜的模型
In [21] :
1 2 3 4 5 6 |
from sklearn.svm import SVC # note: we use the original X without the extra feature plot_learning_curve(SVC(C=2.5, kernel="rbf", gamma=1.0), "SVC(C=2.5, kernel='rbf', gamma=1.0)", X, y, ylim=(0.5, 1.0), train_sizes=np.linspace(.1, 1.0, 5)) |
In [22] :
1 2 3 |
X, y = make_classification(200000, n_features=200, n_informative=25, n_redundant=0, n_classes=10, class_sep=2, random_state=0) |
In [23]:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
from sklearn.linear_model import SGDClassifier est = SGDClassifier(penalty="l2", alpha=0.001) progressive_validation_score = [] train_score = [] for datapoint in range(0, 199000, 1000): X_batch = X[datapoint:datapoint+1000] y_batch = y[datapoint:datapoint+1000] if datapoint > 0: progressive_validation_score.append(est.score(X_batch, y_batch)) est.partial_fit(X_batch, y_batch, classes=range(10)) if datapoint > 0: train_score.append(est.score(X_batch, y_batch)) plt.plot(train_score, label="train score") plt.plot(progressive_validation_score, label="progressive validation score") plt.xlabel("Mini-batch") plt.ylabel("Score") plt.legend(loc='best') |
1 |
<matplotlib.legend.Legend at 0x7f6a24e2dfd0> |
In [24]:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
from sklearn.datasets import load_digits digits = load_digits(n_class=6) X = digits.data y = digits.target n_samples, n_features = X.shape print "Dataset consist of %d samples with %d features each" % (n_samples, n_features) # Plot images of the digits n_img_per_row = 20 img = np.zeros((10 * n_img_per_row, 10 * n_img_per_row)) for i in range(n_img_per_row): ix = 10 * i + 1 for j in range(n_img_per_row): iy = 10 * j + 1 img[ix:ix + 8, iy:iy + 8] = X[i * n_img_per_row + j].reshape((8, 8)) plt.imshow(img, cmap=plt.cm.binary) plt.xticks([]) plt.yticks([]) _ = plt.title('A selection from the 8*8=64-dimensional digits dataset') |
In [25]:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
# Helper function based on # http://scikit-learn.org/stable/auto_examples/manifold/plot_lle_digits.html#example-manifold-plot-lle-digits-py from matplotlib import offsetbox def plot_embedding(X, title=None): x_min, x_max = np.min(X, 0), np.max(X, 0) X = (X - x_min) / (x_max - x_min) plt.figure(figsize=(10, 10)) ax = plt.subplot(111) for i in range(X.shape[0]): plt.text(X[i, 0], X[i, 1], str(digits.target[i]), color=plt.cm.Set1(y[i] / 10.), fontdict={'weight': 'bold', 'size': 12}) if hasattr(offsetbox, 'AnnotationBbox'): # only print thumbnails with matplotlib > 1.0 shown_images = np.array([[1., 1.]]) # just something big for i in range(digits.data.shape[0]): dist = np.sum((X[i] - shown_images) ** 2, 1) if np.min(dist) < 4e-3: # don't show points that are too close continue shown_images = np.r_[shown_images, [X[i]]] imagebox = offsetbox.AnnotationBbox( offsetbox.OffsetImage(digits.images[i], cmap=plt.cm.gray_r), X[i]) ax.add_artist(imagebox) plt.xticks([]), plt.yticks([]) if title is not None: plt.title(title) |
In [26]:
1 2 3 4 5 |
from sklearn import (manifold, decomposition, random_projection) rp = random_projection.SparseRandomProjection(n_components=2, random_state=42) stime = time.time() X_projected = rp.fit_transform(X) plot_embedding(X_projected, "Random Projection of the digits (time: %.3fs)" % (time.time() - stime)) |
In [27]:
1 2 3 4 |
X_pca = decomposition.TruncatedSVD(n_components=2).fit_transform(X) stime = time.time() plot_embedding(X_pca, "Principal Components projection of the digits (time: %.3fs)" % (time.time() - stime)) |
In [28]:
1 2 3 4 5 |
tsne = manifold.TSNE(n_components=2, init='pca', random_state=0) stime = time.time() X_tsne = tsne.fit_transform(X) plot_embedding(X_tsne, "t-SNE embedding of the digits (time: %.3fs)" % (time.time() - stime)) |
In [29]:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
# adapted from http://scikit-learn.org/stable/auto_examples/linear_model/plot_sgd_loss_functions.html xmin, xmax = -4, 4 xx = np.linspace(xmin, xmax, 100) plt.plot([xmin, 0, 0, xmax], [1, 1, 0, 0], 'k-', label="Zero-one loss") plt.plot(xx, np.where(xx < 1, 1 - xx, 0), 'g-', label="Hinge loss") plt.plot(xx, np.log2(1 + np.exp(-xx)), 'r-', label="Log loss") plt.plot(xx, np.exp(-xx), 'c-', label="Exponential loss") plt.plot(xx, -np.minimum(xx, 0), 'm-', label="Perceptron loss") # the balanced relative margin machine #R = 2 #plt.plot(xx, np.where(xx < 1, 1 - xx, (np.where(xx > R, xx-R,0))), 'b-', # label="L1 Balanced Relative Margin Loss") plt.ylim((0, 8)) plt.legend(loc="upper right") plt.xlabel(r"Decision function $f(x)$") plt.ylabel("$L(y, f(x))$") |
1 |
<matplotlib.text.Text at 0x7f6a2879cf90> |
- 0-1損失是在分類問題中你實際上需要的。不幸地是,這是非凸優化問題,由於最優化問題會變得或多或少的不好解決,因此並不實用。
- 合頁損失(使用支援向量分類)匯出一個在資料中稀疏的解(由於$f(x) > 1$,它變為0),而且對離群點比較穩健(由於$f(x)to-infty$,它僅僅成線性增長)。它不提供充分的校準的概率。
- 對數損失函式(比如,在邏輯迴歸中使用)匯出很好的概率校準。因此,如果你不僅得到二值預測,還可以得出結果的概率,這個損失函式是一個很好的選擇。缺點是,它的解在資料空間中是不稀疏的,它比合頁損失函式更容易受到離群點的影響。
- 指數損失函式(在Adaboost中使用)非常容易受離群點的影響(由於當$f(x)to-infty$時它快速增加)。它主要適用於Adaboost中,因為它在一個簡單有效的boosting演算法中有效果。
- 感知器損失函式基本上是合頁損失函式的移動版本。合頁損失函式也懲罰非常接近邊界但是在正確一邊的點(間隔最大化準則)。另一方面,感知器損失函式只要資料點在邊界正確的一邊就可以,如果資料是線性可分就使得邊界待定,導致比間隔最大化更差的泛化性。
以上我們討論了一些怎麼讓機器學習在一個新的問題上工作起來的建議。我們考慮了分類問題,迴歸和聚類問題也與之類似。然而,專注於人工資料集(為了便於理解)還有點過於簡單化。在很多實際問題中,資料的收集、組織、預處理是極重要的。請參見本文中data wrangling的例子。Pandas是這方面很好的工具。
- 使用skimage圖片處理
- 使用pySPACE的生物訊號分析和一般時間序列處理
- 使用pandas處理財務資料
In [30]:
1 |
Image(filename='algorithm_types_detailed.png', width=800, height=600) |
In [31]:
1 2 3 |
%load_ext watermark %watermark -a "Jan Hendrik Metzen" -d -v -m -p numpy,scikit-learn |
Jan Hendrik Metzen 29/01/2015
CPython 2.7.9
IPython 2.1.0
numpy 1.9.1
scikit-learn 0.14.1
compiler : GCC 4.4.7 20120313 (Red Hat 4.4.7-1)
system : Linux
release : 3.16.0-28-generic
machine : x86_64
processor : x86_64
CPU cores : 4
interpreter: 64bit