Python機器學習實踐與Kaggle實戰

bbzz2發表於2017-08-15

（轉）https://mlnote.wordpress.com/2015/12/16/python%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0%E5%AE%9E%E8%B7%B5%E4%B8%8Ekaggle%E5%AE%9E%E6%88%98-machine-learning-for-kaggle-competition-in-python/

Author: Miao Fan (範淼), Ph.D. candidate on Computer Science.

Affiliation: Tsinghua University / New York University

[C.V.] [Google Scholar] [Special Talk in NYU]

Email: fanmiao.cslt.thu@gmail.com

宣告：

下面這些內容，都是學習《Learning scikit-learn: Machine Learning in Python》這本書的心得和一些擴充，寫到哪算哪。Scikit-learn這個包我關注了2年，發展迅速；特別是它提供商業使用許可，比較有前景。

對於機器學習實踐的“選手”，這是本入門的好書，國內目前沒有中文譯文版，我就先吃吃螃蟹。我個人認為，如果能夠比較熟練掌握 Scikit-learn中的各種現有成熟模型的使用以及超引數優化（其實對超引數優化在工程中更加重要），那麼Kaggle多數的競賽大家基本可以進入Top25%。

這份長篇筆記中的程式碼連結目前都在本地，不久我會上傳到GITHUB上。

平心而論，只有使用這些模型的經驗豐富了，才能在實戰中發揮作用，特別是對超引數和模型本身侷限性的理解，恐怕不是書本所能教會的。

另外，更是有一些可以嘗試的，並且曾經在Kaggle競賽中多次獲獎的模型包，比如 Xgboost, gensim等。Tensorflow究竟是否能夠取得Kaggle競賽的獎金，我還需要時間嘗試。

同時，我個人近期也參與了《Deep Learning》這本優質新書多個章節貢獻和校對，與三位作者平時的交流也深受啟發。如果有興趣的同學可以郵件本人，並一起參與中文筆記的撰文。

轉載的朋友請註明來源，非常感謝。

平臺選取：我個人推薦這個綜合平臺Anaconda進行練習https://www.continuum.io/downloads

，同時新加入的其他包也可以在這個平臺上擴充，幾乎常用的作業系統都可以安裝，一次性解決複雜的配置問題。

回國之後，對於我這個從來沒摸過蘋果電腦和系統的菜鳥，再添置一個IMAC 27”犒勞一下自己:) （題外話）。

因為後面的程式碼都是在Ipython環境下的，因此有一些地方沒有print這個函式幫助輸出，請讀者留意。In[*]/Out[*]這種標記也是Ipython特有的。

這份筆記圍繞Python下的機器學習實踐一共探討四個方面的內容：監督學習、無監督學習、特徵和模型的選取和幾個流行的強力模型包的使用。

我特別喜歡用幾句話對某些東西做個總結，對於Kaggle的任務，我個人覺得大體需要這麼幾個固定的機器學習流程（不包括決定性的分析），如果按照這個流程，採用scikit-learn & pandas包的話，普遍都會進Top25%:

1) pandas 讀 csv或者tsv (Kaggle上的資料基本都比較整潔)

2) 特徵少的話，補全資料，feature_extraction (DictVec, tfidfVec等等，根據資料型別而異，文字，影像，音訊，這些處理方式都不同), feature_selection, grid_searching the best hyperparameters(model_selection), ensemble learning （或者綜合好多學習器的結果）, predict 或者 proba_predict （取決於任務的提交要求，是直接分類結果，還是分類概率，這個區別很大）。

3) 特徵多的話，補全資料，feature_extraction (DictVec, tfidfVec等等，根據資料型別而異，文字，影像，音訊，這些處理方式都不同), 資料降維度（PCA，RBM等等），feature_selection (如果降維度之後還有必要), ensemble learning （或者綜合好多學習器的結果）, predict 或者 proba_predict （取決於任務的提交要求，是直接分類結果，還是分類概率，這個區別很大）。

1. 監督學習

~~~~~~~~~~~~~~~~~~~~~~~~~~~

1.1 線性分類器

使用Scikit-learn資料庫中預裝的牽牛花品種資料，進行線性分類實踐。線性分類器中，Logistic Regression比較常用，適合做概率估計，即對分配給每個類別一個估計概率。這個在Kaggle競賽裡經常需要。【Source Code】

In [1]:

from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn import preprocessing

# 讀取資料
iris = load_iris()

# 選取特徵與標籤
X_iris, y_iris = iris.data, iris.target

# 選擇前兩列資料作為特徵
X, y = X_iris[:, :2], y_iris

# 選取一部分，25%的訓練資料作為測試集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state = 33)

# 對原特徵資料進行標準化預處理，這個其實挺重要，但是經常被一些選手忽略
scaler = preprocessing.StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

from sklearn.linear_model import SGDClassifier

# 選擇使用SGD分類器，適合大規模資料，隨機梯度下降方法估計引數
clf = SGDClassifier()

clf.fit(X_train, y_train)

# 匯入評價包
from sklearn import metrics

y_train_predict = clf.predict(X_train)

# 內測，使用訓練樣本進行準確效能評估
print metrics.accuracy_score(y_train, y_train_predict)

# 標準外測，使用測試樣本進行準確效能評估
y_predict = clf.predict(X_test)
print metrics.accuracy_score(y_test, y_predict)

0.660714285714
0.684210526316

In [2]:

# 如果需要更加詳細的效能報告，比如precision, recall, accuracy，可以使用如下的函式。
print metrics.classification_report(y_test, y_predict, target_names = iris.target_names)

             precision    recall  f1-score   support

     setosa       1.00      1.00      1.00         8
 versicolor       0.43      0.27      0.33        11
  virginica       0.65      0.79      0.71        19

avg / total       0.66      0.68      0.66        38

In [4]:

# 如果想詳細探查SGDClassifier的分類效能，我們需要充分利用資料，因此需要把資料切分為N個部分，每個部分都用於測試一次模型效能。

from sklearn.cross_validation import cross_val_score, KFold
from sklearn.pipeline import Pipeline 
from sklearn.preprocessing import StandardScaler
# 這裡使用Pipeline，便於精簡模型搭建，一般而言，模型在fit之前，對資料需要feature_extraction, preprocessing, 等必要步驟。
# 這裡我們使用預設的引數配置
clf = Pipeline([('scaler', StandardScaler()), ('sgd_classifier', SGDClassifier())])

# 5折交叉驗證整個資料集合
cv = KFold(X.shape[0], 5, shuffle=True, random_state = 33)

scores = cross_val_score(clf, X, y, cv=cv)
print scores

# 計算一下模型綜合效能，平均精度和標準差
print scores.mean(), scores.std()

from scipy.stats import sem
import numpy as np

# 這裡使用的偏差計算函式略有不同，參考連結
http://www.graphpad.com/guides/prism/6/statistics/index.htm?stat_semandsdnotsame.htm
print np.mean(scores), sem(scores)

[ 0.56666667  0.73333333  0.83333333  0.76666667  0.8       ]
0.74 0.0928559218479
0.74 0.0464279609239

總結一下：線性分類器有幾種， Logistic_regression在scikit-learn裡也有實現。比起SGD這個分類器而言，前者使用更加精確，但是更加耗時的解析解。SGD分類器可以大體代表這些線性分類器的效能，但是由於是近似估計的引數，因此模型效能結果不是很穩定，需要通過調節超引數獲得模型的效能上限。

~~~~~~~~~~~~~~~~~~~~~~~~~

1.2 SVM 分類器

這一部分，我們探究支援向量機，這是個強分類器，效能要比普通線性分類器強大一些，一般而言，基於的也是線性假設。但是由於可以引入一些核技巧(kernel trick)，可以將特徵對映到更加高維度，甚至非線性的空間上，從而使資料空間變得更加可分。再加上SVM本身只是會選取少量的支援向量作為確定分類器超平面的證據，因此，即便資料變得高維度，非線性對映，也不會佔用太多的記憶體空間，只是計算這些支援向量的CPU代價比較高。另外，這個分類器適合於直接做分類，不適合做分類概率的估計。【Source Code】

這裡我們使用 AT&T 400張人臉，這個經典資料集來介紹:

In [1]:

from sklearn.datasets import fetch_olivetti_faces

# 這部分資料沒有直接儲存在現有包中，都是通過這類函式線上下載
faces = fetch_olivetti_faces()

In [2]:

# 這裡證明，資料是以Dict的形式儲存的，與多數實驗性資料的格式一致
faces.keys()

Out[2]:

['images', 'data', 'target', 'DESCR']

In [3]:

# 使用shape屬性檢驗資料規模
print faces.data.shape
print faces.target.shape

(400L, 4096L)
(400L,)

In [4]:

from sklearn.cross_validation import train_test_split
from sklearn.svm import SVC

# 同樣是分割資料 25%用於測試
X_train, X_test, y_train, y_test = train_test_split(faces.data, faces.target, test_size=0.25, random_state=0)

In [5]:

from sklearn.cross_validation import cross_val_score, KFold
from scipy.stats import sem

# 構造一個便於交叉驗證模型效能的函式（模組）
def evaluate_cross_validation(clf, X, y, K):
    # KFold 函式需要如下引數：資料量, 叉驗次數, 是否洗牌
    cv = KFold(len(y), K, shuffle=True, random_state = 0)
    # 採用上述的分隔方式進行交叉驗證，測試模型效能，對於分類問題，這些得分預設是accuracy，也可以修改為別的
    scores = cross_val_score(clf, X, y, cv=cv)
    print scores
    print 'Mean score: %.3f (+/-%.3f)' % (scores.mean(), sem(scores))
    
# 使用線性核的SVC （後面會說到不同的核，結果可能大不相同）
svc_linear = SVC(kernel='linear')
# 五折交叉驗證 K = 5
evaluate_cross_validation(svc_linear, X_train, y_train, 5)

[ 0.93333333  0.86666667  0.91666667  0.93333333  0.91666667]
Mean score: 0.913 (+/-0.012)

~~~~~~~~~~~~~~~~~~~~~~~~~

1.3 樸素貝葉斯分類器（Naive Bayes)

這一部分我們探討樸素貝葉斯分類器，大量實驗證明，這個分類模型在對文字分類中能表現良好。究其原因，也許是對於郵件過濾這類任務，我們用於區分類別的文字特徵彼此獨立性較強，剛好模型的假設便是特徵獨立。【Source Code】

In [1]:

from sklearn.datasets import fetch_20newsgroups

In [2]:

# 與之前的人臉資料集一樣，20類新聞資料同樣需要臨時下載函式的幫忙
news = fetch_20newsgroups(subset='all')

In [9]:

# 查驗資料，依然採用dict格式，共有18846條樣本
print len(news.data), len(news.target)
print news.target

18846 18846
[10  3 17 ...,  3  1  7]

In [4]:

# 查驗一下新聞類別和種數
print news.target_names
print news.target_names.__len__()

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']
20

In [5]:

# 同樣，我們選取25%的資料用來測試模型效能
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(news.data, news.target, test_size=0.25)

In [6]:

print X_train.__len__()
print y_train.__len__()
print X_test.__len__()

14134
14134
4712

In [13]:

# 許多原始資料無法直接被分類器所使用，影像可以直接使用pixel資訊，文字則需要進一步處理成數值化的資訊
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.cross_validation import *
from scipy.stats import sem
# 我們在NB_Classifier的基礎上，對比幾種特徵抽取方法的效能。並且使用Pipline簡化構建訓練流程
clf_1 = Pipeline([('count_vec', CountVectorizer()), ('mnb', MultinomialNB())])
clf_2 = Pipeline([('hash_vec', HashingVectorizer(non_negative=True)), ('mnb', MultinomialNB())])
clf_3 = Pipeline([('tfidf_vec', TfidfVectorizer()), ('mnb', MultinomialNB())])

# 構造一個便於交叉驗證模型效能的函式（模組）
def evaluate_cross_validation(clf, X, y, K):
    # KFold 函式需要如下引數，資料量, K,是否洗牌
    cv = KFold(len(y), K, shuffle=True, random_state = 0)
    # 採用上述的分隔方式進行交叉驗證，測試模型效能，對於分類問題，這些得分預設是accuracy，也可以修改為別的
    scores = cross_val_score(clf, X, y, cv=cv)
    print scores
    print 'Mean score: %.3f (+/-%.3f)' % (scores.mean(), sem(scores))

In [14]:

clfs = [clf_1, clf_2, clf_3]
for clf in clfs:
    evaluate_cross_validation(clf, X_train, y_train, 5)

[ 0.83516095  0.83374602  0.84471171  0.83622214  0.83227176]
Mean score: 0.836 (+/-0.002)
[ 0.76052352  0.72727273  0.77538026  0.74778918  0.75194621]
Mean score: 0.753 (+/-0.008)
[ 0.84435798  0.83409975  0.85496993  0.84082066  0.83227176]
Mean score: 0.841 (+/-0.004)

In [15]:

# 從上述結果中，我們發現常用的兩個特徵提取方法得到的效能相當。 讓我們選取其中之一，進一步靠特徵的精細篩選提升效能。
clf_4 = Pipeline([('tfidf_vec_adv', TfidfVectorizer(stop_words='english')), ('mnb', MultinomialNB())])
evaluate_cross_validation(clf_4, X_train, y_train, 5)

[ 0.87053414  0.86664308  0.887867    0.87371772  0.86553432]
Mean score: 0.873 (+/-0.004)

In [16]:

# 如果再嘗試修改貝葉斯分類器的平滑引數，也許效能會更上一層樓。
clf_5 = Pipeline([('tfidf_vec_adv', TfidfVectorizer(stop_words='english')), ('mnb', MultinomialNB(alpha=0.01))])
evaluate_cross_validation(clf_5, X_train, y_train, 5)

[ 0.90060134  0.89741776  0.91651928  0.90909091  0.90410474]
Mean score: 0.906 (+/-0.003)

~~~~~~~~~~~~~~~~~~~~~~~~~

1.4 決策樹分類器（Decision Tree) / 整合分類器（Ensemble Tree）

之前的分類器大多有一下幾點缺陷：

a)線性分類器對於特徵與類別直接的關係是“線性假設”，如果遇到非線性的關係，就很難辨識，比如Titanic資料中，如果假設“年齡”與“生存”是正相關的，那麼年齡越大，越有可能生存；但是事實證明，這個假設是錯誤的，不是正相關，而偏偏是老人與小孩更加容易獲得生存的機會。這種情況，線性假設不完全成立，因此，需要非線性的分類器。

b)即便使用類似SVM的分類器，我們很難得到明確分類“依據”的說明，無法“解釋”分類器是如何工作的，特別無法從人類邏輯的角度理解。高維度、不可解釋性等，這些都是弊端。

決策樹分類器解決了上述兩點問題。我們使用Titanic（泰坦尼克號的救援記錄）這個資料集來實踐一個預測某乘客是否獲救的分類器。【Source Code】

In [1]:

# 這裡為了處理資料方便，我們引入一個新的工具包pandas

import pandas as pd
import numpy as np

titanic = pd.read_csv('http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt')

In [2]:

#瞧瞧資料，什麼資料特徵的都有，有數值型的、類別型的，字串，甚至還有缺失的資料等等。
titanic.head()

# 使用pandas，資料都轉入pandas獨有的dataframe格式（二維資料表格），直接使用info()，檢視資料的基本特徵
titanic.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1313 entries, 0 to 1312
Data columns (total 11 columns):
row.names    1313 non-null int64
pclass       1313 non-null object
survived     1313 non-null int64
name         1313 non-null object
age          633 non-null float64
embarked     821 non-null object
home.dest    754 non-null object
room         77 non-null object
ticket       69 non-null object
boat         347 non-null object
sex          1313 non-null object
dtypes: float64(1), int64(2), object(8)
memory usage: 123.1+ KB

In [4]:

# 這份調查資料是真實的泰坦尼克號乘客個人和登船資訊，有助於我們預測每位遇難乘客是否倖免。
# 一共1313條資料，有些特徵是完整的（比如 pclass, survived, name），有些是有缺失的；有些是數值型別的資訊（age: float64），有些則是字串。
# 機器學習有一個不太被初學者重視，並且耗時，但是十分重要的一環，特徵的選擇，這個需要基於一些背景知識。根據我們對這場事故的瞭解，sex, age, pclass這些都很有可能是決定倖免與否的關鍵因素。

# we keep pclass, age, sex.

X = titanic[['pclass', 'age', 'sex']]
y = titanic['survived']

In [5]:

X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1313 entries, 0 to 1312
Data columns (total 3 columns):
pclass    1313 non-null object
age       633 non-null float64
sex       1313 non-null object
dtypes: float64(1), object(2)
memory usage: 41.0+ KB

In [6]:

# 下面有幾個對資料處理的任務
# 1) age這個資料列，只有633個
# 2) sex 與 pclass兩個資料列的值都是類別型的，需要轉化為數值特徵，用0/1代替

# 首先我們補充age裡的資料，使用平均數或者中位數都是對模型偏離造成最小影響的策略
X['age'].fillna(X['age'].mean(), inplace=True)

C:\Anaconda2\lib\site-packages\pandas\core\generic.py:2748: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)

In [7]:

X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1313 entries, 0 to 1312
Data columns (total 3 columns):
pclass    1313 non-null object
age       1313 non-null float64
sex       1313 non-null object
dtypes: float64(1), object(2)
memory usage: 41.0+ KB

In [8]:

from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state = 33)

# 我們使用scikit-learn中的feature_extraction
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer(sparse=False)
X_train = vec.fit_transform(X_train.to_dict(orient='record'))
print vec.feature_names_
# 我們發現，凡是類別型的特徵都單獨剝離出來，獨成一列特徵，數值型的則保持不變

['age', 'pclass=1st', 'pclass=2nd', 'pclass=3rd', 'sex=female', 'sex=male']

In [9]:

X_test = vec.transform(X_test.to_dict(orient='record'))
from sklearn.tree import DecisionTreeClassifier

dtc = DecisionTreeClassifier(criterion='entropy', max_depth=3, min_samples_leaf=5)
dtc.fit(X_train, y_train)
dtc.score(X_test, y_test)

Out[9]:

0.79331306990881456

In [10]:

from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(max_depth=3, min_samples_leaf=5)
rfc.fit(X_train, y_train)
rfc.score(X_test, y_test)

Out[10]:

0.77203647416413379

In [11]:

from sklearn.ensemble import GradientBoostingClassifier

gbc = GradientBoostingClassifier(max_depth=3, min_samples_leaf=5)

gbc.fit(X_train, y_train)
gbc.score(X_test, y_test)

Out[11]:

0.79027355623100304

In [13]:

from sklearn.metrics import classification_report

y_predict = gbc.predict(X_test)
print classification_report(y_predict, y_test)
# 這裡的函式可以便於生成分類器效能報告（precision,recall)這些是在二分類背景下才有的指標。

             precision    recall  f1-score   support

          0       0.93      0.78      0.84       241
          1       0.57      0.83      0.68        88

avg / total       0.83      0.79      0.80       329

~~~~~~~~~~~~~~~~~~~~~~~~~

1.5 迴歸問題（Regressions）

迴歸問題和分類問題都同屬於監督學習範疇，唯一不同的是，迴歸問題的預測目標是在無限的連續實數域，比如預測房價、股票價格等等；分類問題則是對有限範圍的幾個類別（離散數）進行預測。當然兩者的界限不一定涇渭分明，也可以適度轉化。比如，有一個經典的對紅酒質量的預測，大體分為10等級，怎樣看待這個預測目標，都是可以的。預測的結果，可以在（0-10]區間連續（迴歸問題），也可以只預測10個等級的某個值（分類問題）。

這裡我們舉一個預測美國波士頓地區房價的問題，這是個經典的迴歸問題，我們一步步採用各種用於迴歸問題的訓練模型，一步步嘗試提升模型的迴歸效能。【Source Code】

In [1]:

# 首先預讀房價資料
from sklearn.datasets import load_boston

boston = load_boston()

# 查驗資料規模
print boston.data.shape

(506L, 13L)

In [2]:

# 多多弄懂資料特徵的含義也是一個好習慣
print boston.feature_names
print boston.DESCR

['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']
Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
**References**

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
   - many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)

In [3]:

# 這裡多一個步驟，查驗資料是否正規化，一般都是沒有的
import numpy as np

print np.max(boston.target)
print np.min(boston.target)
print np.mean(boston.target)

50.0
5.0
22.5328063241

In [4]:

from sklearn.cross_validation import train_test_split
# 依然如故，我們對資料進行分割
X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size = 0.25, random_state=33)

from sklearn.preprocessing import StandardScaler

# 正規化的目的在於避免原始特徵值差異過大，導致訓練得到的引數權重不一
scalerX = StandardScaler().fit(X_train)
X_train = scalerX.transform(X_train)
X_test = scalerX.transform(X_test)

scalery = StandardScaler().fit(y_train)
y_train = scalery.transform(y_train)
y_test = scalery.transform(y_test)

In [5]:

# 先把評價模組寫好，依然是預設5折交叉驗證，只是這裡的評價指標不再是精度，而是另一個函式R2，大體上，這個得分多少代表有多大百分比的迴歸結果可以被訓練器覆蓋和解釋
from sklearn.cross_validation import *

def train_and_evaluate(clf, X_train, y_train):
    cv = KFold(X_train.shape[0], 5, shuffle=True, random_state=33)
    scores = cross_val_score(clf, X_train, y_train, cv=cv)
    print 'Average coefficient of determination using 5-fold cross validation:', np.mean(scores)
    
#最後讓我們看看有多少種迴歸模型可以被使用（其實有更多）。
# 比較有代表性的有3種

In [7]:

# 先用線性模型嘗試， SGD_Regressor
from sklearn import linear_model
# 這裡有一個正則化的選項penalty，目前14維特徵也許不會有太大影響
clf_sgd = linear_model.SGDRegressor(loss='squared_loss', penalty=None, random_state=42)
train_and_evaluate(clf_sgd, X_train, y_train)

Average coefficient of determination using 5-fold cross validation: 0.710809853468

In [8]:

# 再換一個SGD_Regressor的penalty引數為l2,結果貌似影響不大，因為特徵太少，正則化意義不大
clf_sgd_l2 = linear_model.SGDRegressor(loss='squared_loss', penalty='l2', random_state=42)
train_and_evaluate(clf_sgd_l2, X_train, y_train)

Average coefficient of determination using 5-fold cross validation: 0.71081206667

In [9]:

# 再看看SVM的regressor怎麼樣（都是預設引數）, 
from sklearn.svm import SVR
# 使用線性核沒有啥子提升，但是因為特徵少，所以可以考慮升高維度
clf_svr = SVR(kernel='linear')
train_and_evaluate(clf_svr, X_train, y_train)

Average coefficient of determination using 5-fold cross validation: 0.707838419194

In [11]:

clf_svr_poly = SVR(kernel='poly')
# 升高維度，效果明顯，但是此招慎用@@，特徵高的話, CPU還是受不了，記憶體倒是小事。其實到了現在，連我們自己都沒辦法直接解釋這些特徵的具體含義了。
train_and_evaluate(clf_svr_poly, X_train, y_train)

Average coefficient of determination using 5-fold cross validation: 0.779288545488

In [12]:

clf_svr_rbf = SVR(kernel='rbf')
# RBF (徑向基核更是牛逼！)
train_and_evaluate(clf_svr_rbf, X_train, y_train)

Average coefficient of determination using 5-fold cross validation: 0.833662221567

In [14]:

# 再來個更猛的! 極限迴歸森林，放大招了！！！
from sklearn import ensemble
clf_et = ensemble.ExtraTreesRegressor()
train_and_evaluate(clf_et, X_train, y_train)

Average coefficient of determination using 5-fold cross validation: 0.853006383633

In [15]:

# 最後看看在測試集上的表現
clf_et.fit(X_train, y_train)
clf_et.score(X_test, y_test)

Out[15]:

0.83781467779895469

總結來看，我們可以通過這個例子得到機器學習不斷進取的快感！一點點提高模型效能，並且，也能感覺到超引數的作用有時比更換模型的提升效果更好。而且這裡也為後續的“特徵選擇”，“模型選擇”等實用話題埋下了伏筆。

2. 無監督學習

~~~~~~~~~~~~~~~~~~~~~~~~~~~

無監督學習(Unsupervised Learning)事實上比起監督學習，區別在於：沒有預測/學習目標（Target）。

這類學習問題往往資料資源更加豐富，因為很大程度上，監督學習(Supervised Learning)經常需要人類來承擔標註工作，從而“教會”計算機做預測工作；而且這個工作有時候還需要專業人士參與，比如Iris那種資料庫，不是專家（這時候專家還是有用的），一般人是分辨不了的。

如果說，監督學習有兩大基本型別：分類和迴歸（事實上還有序列化標註等更加複雜的問題）；那麼無監督學習有：聚類、降維等問題。

監督學習問題，我們需要通過標註的“反饋”來“訓練”模型引數；無監督學習問題則更加傾向於尋找資料特徵本身之間的“共性”或者叫“模式”。比如：聚類問題，通過尋找資料之間“相似”的特徵表達，來發現資料的“群落”。降維/壓縮問題則是選取資料具有代表性的特徵，在保持資料多樣性(variance)的基礎上，規避掉大量的特徵冗餘和噪聲，不過這個過程也很有可能會損失一些有用的模式資訊。

2.1 主成分分析 (PCA降維)

~~~~~~~~~~~~~~~~~~~~~

首先我們思考一個小例子，這個完全是我的個人理解，也是經常用來向周圍朋友解釋降低維度，資訊冗餘和PCA功能的。比如，我們有一組2 * 2的資料[(1, 2), (2, 4)}]，假設這兩個資料都反映到一個類別（分類）或者一個類簇（聚類）。但是如果我們的學習模型是線性模型，那麼這兩個資料其實只能幫助權重引數更新一次，因為他們線性相關，所有的特徵數值都只是擴張了相同的倍數。如果使用PCA分析的話，這個矩陣的“秩”是1，也就是說，在多樣性程度上，這個矩陣只有一個自由度。

其實，我們也可以把PCA當做特徵選擇，只是和普通理解的不同，這種特徵選擇是首先把原來的特徵空間做了對映，使得新的對映後特徵空間資料彼此正交。

下面就讓我們進入正題，看看PCA在哪些具體應用上可以使用。第一個例子便是手寫數字識別（最終還是應用在監督學習上，不過中間的特徵取樣過程用到PCA）。

In [1]:

import numpy as np
# 先熱個身，牛刀小試
M = np.array([[1, 2], [2, 4]])
M

Out[1]:

array([[1, 2],
       [2, 4]])

In [2]:

np.linalg.matrix_rank(M, tol=None)
# 獲取M矩陣的秩=1

Out[2]:

In [3]:

# 載入手寫數字的影像畫素資料。對於影像處理，除了後續的各種啟發式提取有效特徵以外，
# 最直接常用的就是畫素資料，每個畫素都是一個數值，反映顏色。
from sklearn.datasets import load_digits
digits = load_digits()
# 這些經典資料的儲存格式非常統一。這是好習慣，統一了介面，也便於快速使用。
digits

Out[3]:

{'DESCR': " Optical Recognition of Handwritten Digits Data Set\n\nNotes\n-----\nData Set Characteristics:\n    :Number of Instances: 5620\n    :Number of Attributes: 64\n    :Attribute Information: 8x8 image of integer pixels in the range 0..16.\n    :Missing Attribute Values: None\n    :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)\n    :Date: July; 1998\n\nThis is a copy of the test set of the UCI ML hand-written digits datasets\nhttp://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits\n\nThe data set contains images of hand-written digits: 10 classes where\neach class refers to a digit.\n\nPreprocessing programs made available by NIST were used to extract\nnormalized bitmaps of handwritten digits from a preprinted form. From a\ntotal of 43 people, 30 contributed to the training set and different 13\nto the test set. 32x32 bitmaps are divided into nonoverlapping blocks of\n4x4 and the number of on pixels are counted in each block. This generates\nan input matrix of 8x8 where each element is an integer in the range\n0..16. This reduces dimensionality and gives invariance to small\ndistortions.\n\nFor info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.\nT. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.\nL. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469,\n1994.\n\nReferences\n----------\n  - C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their\n    Applications to Handwritten Digit Recognition, MSc Thesis, Institute of\n    Graduate Studies in Science and Engineering, Bogazici University.\n  - E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.\n  - Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.\n    Linear dimensionalityreduction using relevance weighted LDA. School of\n    Electrical and Electronic Engineering Nanyang Technological University.\n    2005.\n  - Claudio Gentile. A New Approximate Maximal Margin Classification\n    Algorithm. NIPS. 2000.\n",
 'data': array([[  0.,   0.,   5., ...,   0.,   0.,   0.],
        [  0.,   0.,   0., ...,  10.,   0.,   0.],
        [  0.,   0.,   0., ...,  16.,   9.,   0.],
        ..., 
        [  0.,   0.,   1., ...,   6.,   0.,   0.],
        [  0.,   0.,   2., ...,  12.,   0.,   0.],
        [  0.,   0.,  10., ...,  12.,   1.,   0.]]),
 'images': array([[[  0.,   0.,   5., ...,   1.,   0.,   0.],
         [  0.,   0.,  13., ...,  15.,   5.,   0.],
         [  0.,   3.,  15., ...,  11.,   8.,   0.],
         ..., 
         [  0.,   4.,  11., ...,  12.,   7.,   0.],
         [  0.,   2.,  14., ...,  12.,   0.,   0.],
         [  0.,   0.,   6., ...,   0.,   0.,   0.]],
 
        [[  0.,   0.,   0., ...,   5.,   0.,   0.],
         [  0.,   0.,   0., ...,   9.,   0.,   0.],
         [  0.,   0.,   3., ...,   6.,   0.,   0.],
         ..., 
         [  0.,   0.,   1., ...,   6.,   0.,   0.],
         [  0.,   0.,   1., ...,   6.,   0.,   0.],
         [  0.,   0.,   0., ...,  10.,   0.,   0.]],
 
        [[  0.,   0.,   0., ...,  12.,   0.,   0.],
         [  0.,   0.,   3., ...,  14.,   0.,   0.],
         [  0.,   0.,   8., ...,  16.,   0.,   0.],
         ..., 
         [  0.,   9.,  16., ...,   0.,   0.,   0.],
         [  0.,   3.,  13., ...,  11.,   5.,   0.],
         [  0.,   0.,   0., ...,  16.,   9.,   0.]],
 
        ..., 
        [[  0.,   0.,   1., ...,   1.,   0.,   0.],
         [  0.,   0.,  13., ...,   2.,   1.,   0.],
         [  0.,   0.,  16., ...,  16.,   5.,   0.],
         ..., 
         [  0.,   0.,  16., ...,  15.,   0.,   0.],
         [  0.,   0.,  15., ...,  16.,   0.,   0.],
         [  0.,   0.,   2., ...,   6.,   0.,   0.]],
 
        [[  0.,   0.,   2., ...,   0.,   0.,   0.],
         [  0.,   0.,  14., ...,  15.,   1.,   0.],
         [  0.,   4.,  16., ...,  16.,   7.,   0.],
         ..., 
         [  0.,   0.,   0., ...,  16.,   2.,   0.],
         [  0.,   0.,   4., ...,  16.,   2.,   0.],
         [  0.,   0.,   5., ...,  12.,   0.,   0.]],
 
        [[  0.,   0.,  10., ...,   1.,   0.,   0.],
         [  0.,   2.,  16., ...,   1.,   0.,   0.],
         [  0.,   0.,  15., ...,  15.,   0.,   0.],
         ..., 
         [  0.,   4.,  16., ...,  16.,   6.,   0.],
         [  0.,   8.,  16., ...,  16.,   8.,   0.],
         [  0.,   1.,   8., ...,  12.,   1.,   0.]]]),
 'target': array([0, 1, 2, ..., 8, 9, 8]),
 'target_names': array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])}

In [4]:

# 老套路
X_digits, y_digits = digits.data, digits.target

In [11]:

from sklearn.decomposition import PCA
from matplotlib import pyplot as plt
# 最關鍵的引數就是n_components = 2個主成分

estimator = PCA(n_components=2)

X_pca = estimator.fit_transform(X_digits)
# scikit-learn的介面設計的很統一。

# 聚類問題經常需要直觀的展現資料，降維度的一個直接目的也為此；因此我們這裡多展現幾個圖片直觀一些。

def plot_pca_scatter():
    colors = ['black', 'blue', 'purple', 'yellow', 'white', 'red', 'lime', 'cyan', 'orange', 'gray']
    for i in xrange(len(colors)):
        px = X_pca[:, 0][y_digits == i]
        py = X_pca[:, 1][y_digits == i]
        plt.scatter(px, py, c=colors[i])
    plt.legend(digits.target_names)
    plt.xlabel('First Principal Component')
    plt.ylabel('Second Principal Component')
    plt.show()
    
plot_pca_scatter()

2.2 聚類演算法 (K-means等)

~~~~~~~~~~~~~~~~~~~~~~~~~~~

聚類就是找“相似”，只是要定義這個相似需要兩個要素：特徵表示，距離計算；這兩個要素都會影響相似度的結論。

2.3 RBM

~~~~~~~~~~~~~~~~~~~~~~~~~~~

因為深度學習太火，Scikit-learn也加入了何其相關的一個RBM模型，聽說後續版本還有深度監督學習模型，DBN之類的，很期待。

3. 特徵、模型的選擇（高階話題）

~~~~~~~~~~~~~~~~~~~~~~~~~~~

3.1 特徵選擇 (feature_selection)

這裡我們繼續沿用Titanic資料集，這次側重於對模型的區分能力貢獻最大的幾個特徵選取的問題。

不良的特徵會對模型的精度“拖後腿”；冗餘的特徵雖然不會影響模型的精度，不過CPU計算做了無用功。

我個人理解，這種特徵選擇與PCA這類特徵壓縮選擇主成分的略有區別：PCA重建之後的特徵我們已經無法解釋其意義了。

【Source Code】

In [1]:

# 這部分程式碼和原著的第四章節有相同的效果，但是充分利用pandas會表達的更加簡潔，因此我重新編寫了更加清晰簡潔的程式碼。
import pandas as pd
import numpy as np

titanic = pd.read_csv('http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt')

print titanic.info()
# 還是這組資料
titanic.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1313 entries, 0 to 1312
Data columns (total 11 columns):
row.names    1313 non-null int64
pclass       1313 non-null object
survived     1313 non-null int64
name         1313 non-null object
age          633 non-null float64
embarked     821 non-null object
home.dest    754 non-null object
room         77 non-null object
ticket       69 non-null object
boat         347 non-null object
sex          1313 non-null object
dtypes: float64(1), int64(2), object(8)
memory usage: 123.1+ KB
None

In [2]:

# 我們丟掉一些過於特異的，不利於找到共同點的資料列， row.names, name, 同時分離出預測列。

y = titanic['survived']
X = titanic.drop(['row.names', 'name', 'survived'], axis = 1)

In [3]:

# 對於連續的數值特徵，我們採用補完的方式
X['age'].fillna(X['age'].mean(), inplace=True)

X.fillna('UNKNOWN', inplace=True)

In [4]:

# 剩下的類別型別資料，我們直接向量化，這樣的話，對於有空白特徵的列，我們也單獨視作一個特徵

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=33)

from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer()
X_train = vec.fit_transform(X_train.to_dict(orient='record'))
X_test = vec.transform(X_test.to_dict(orient='record'))

In [5]:

print len(vec.feature_names_)

In [6]:

X_train.toarray()

Out[6]:

array([[ 31.19418104,   0.        ,   0.        , ...,   0.        ,
          0.        ,   1.        ],
       [ 31.19418104,   0.        ,   0.        , ...,   0.        ,
          0.        ,   0.        ],
       [ 31.19418104,   0.        ,   0.        , ...,   0.        ,
          0.        ,   1.        ],
       ..., 
       [ 12.        ,   0.        ,   0.        , ...,   0.        ,
          0.        ,   1.        ],
       [ 18.        ,   0.        ,   0.        , ...,   0.        ,
          0.        ,   1.        ],
       [ 31.19418104,   0.        ,   0.        , ...,   0.        ,
          0.        ,   1.        ]])

In [7]:

from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(criterion='entropy')
dt.fit(X_train, y_train)
dt.score(X_test, y_test)
# 採用所有特徵的測試精度

Out[7]:

0.81762917933130697

In [8]:

from sklearn import feature_selection
fs = feature_selection.SelectPercentile(feature_selection.chi2, percentile=20)

X_train_fs = fs.fit_transform(X_train, y_train)
dt.fit(X_train_fs, y_train)
X_test_fs = fs.transform(X_test)
dt.score(X_test_fs, y_test)
# 採用20%高預測性特徵的測試精度

Out[8]:

0.82370820668693012

In [9]:

from sklearn.cross_validation import cross_val_score
percentiles = range(1, 100, 2)

results = []

for i in percentiles:
    fs = feature_selection.SelectPercentile(feature_selection.chi2, percentile = i)
    X_train_fs = fs.fit_transform(X_train, y_train)
    scores = cross_val_score(dt, X_train_fs, y_train, cv=5)
    results = np.append(results, scores.mean())
print results

opt = np.where(results == results.max())[0]
print 'Optimal number of features %d' %percentiles[opt]
import pylab as pl

pl.plot(percentiles, results)
pl.show()

[ 0.85063904  0.85673057  0.87501546  0.88622964  0.86590394  0.87097506
  0.87303649  0.86997526  0.87097506  0.87300557  0.86997526  0.86893424
  0.87098536  0.86490414  0.86385281  0.86791383  0.86488353  0.86892393
  0.86791383  0.86284271  0.86487322  0.86792414  0.86894455  0.87303649
  0.86892393  0.86998557  0.86689342  0.86488353  0.86895485  0.86689342
  0.87198516  0.8638322   0.86488353  0.87402597  0.87299526  0.87098536
  0.86997526  0.86892393  0.86794475  0.86486291  0.87096475  0.86587302
  0.86387343  0.86083282  0.86589363  0.8608019   0.86492476  0.85774067
  0.8608122   0.85779221]
Optimal number of features 7

In [10]:

from sklearn import feature_selection
fs = feature_selection.SelectPercentile(feature_selection.chi2, percentile=7)

X_train_fs = fs.fit_transform(X_train, y_train)
dt.fit(X_train_fs, y_train)
X_test_fs = fs.transform(X_test)
dt.score(X_test_fs, y_test)
# 選取搜尋到的最好特徵比例的測試精度

Out[10]:

0.8571428571428571

In [ ]:

# 由此可見，這個技術對於工程上提升精度還是非常有幫助的。

~~~~~~~~~~~~~~~~~~~~~~~~~

3.2 模型（超引數）選擇

由於超引數的空間是無盡的，因此超引數的組合配置只能是“更優”解，沒有最優解。通常情況下，我們依靠“網格搜尋”(GridSearch)對固定步長的超引數空間進行暴力搜尋，對於每組超引數組合代入到學習函式中，視為新模型。為了比較新模型之間的效能，每個模型都會在相同的訓練、開發資料集下進行評估，通常我們採用交叉驗證。因此，這個過程非常耗時，但是一旦獲取比較好的引數，則可以保持一段時間使用，也相對一勞永逸。好在，由於各個新模型的交叉驗證之間是互相獨立的，因此，可以充分利用多核甚至是分散式的計算資源來並行搜尋（Parallel Grid Search）。【Source Code】

In [1]:

from sklearn.datasets import fetch_20newsgroups
import numpy as np
news = fetch_20newsgroups(subset='all')

In [2]:

# 我們首先使用grid_search的單核版本
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
X_train, X_test, y_train, y_test = train_test_split(news.data[:3000], news.target[:3000], test_size=0.25, random_state=33)


from sklearn.svm import SVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline

clf = Pipeline([('vect', TfidfVectorizer(stop_words='english', analyzer='word')), ('svc', SVC())])

# 這裡需要試驗的2個超引數的的個數分別是4、3， svc__gamma的引數共有10^-2, 10^-1... 
# 這樣我們一共有12種的超引數組合，12個不同引數下的模型
parameters = {'svc__gamma': np.logspace(-2, 1, 4), 'svc__C': np.logspace(-1, 1, 3)}

# 再考慮每個模型需要交叉驗證3次，因此一共需要訓練36次模型，根據下面的結果，單執行緒下，每個模型的訓練任務耗時5秒左右。
gs = GridSearchCV(clf, parameters, verbose=2, refit=True, cv=3)

%time _=gs.fit(X_train, y_train)
gs.best_params_, gs.best_score_

print gs.score(X_test, y_test)

Fitting 3 folds for each of 12 candidates, totalling 36 fits
[CV] svc__gamma=0.01, svc__C=0.1 .....................................
[CV] ............................ svc__gamma=0.01, svc__C=0.1 -   5.1s
[CV] svc__gamma=0.01, svc__C=0.1 .....................................
[CV] ............................ svc__gamma=0.01, svc__C=0.1 -   5.3s
[CV] svc__gamma=0.01, svc__C=0.1 .....................................
[CV] ............................ svc__gamma=0.01, svc__C=0.1 -   5.2s
[CV] svc__gamma=0.1, svc__C=0.1 ......................................
[CV] ............................. svc__gamma=0.1, svc__C=0.1 -   5.1s
[CV] svc__gamma=0.1, svc__C=0.1 ......................................
[CV] ............................. svc__gamma=0.1, svc__C=0.1 -   5.2s
[CV] svc__gamma=0.1, svc__C=0.1 ......................................
[CV] ............................. svc__gamma=0.1, svc__C=0.1 -   5.3s
[CV] svc__gamma=1.0, svc__C=0.1 ......................................
[CV] ............................. svc__gamma=1.0, svc__C=0.1 -   5.7s
[CV] svc__gamma=1.0, svc__C=0.1 ......................................
[CV] ............................. svc__gamma=1.0, svc__C=0.1 -   5.8s
[CV] svc__gamma=1.0, svc__C=0.1 ......................................
[CV] ............................. svc__gamma=1.0, svc__C=0.1 -   5.9s
[CV] svc__gamma=10.0, svc__C=0.1 .....................................
[CV] ............................ svc__gamma=10.0, svc__C=0.1 -   5.4s
[CV] svc__gamma=10.0, svc__C=0.1 .....................................
[CV] ............................ svc__gamma=10.0, svc__C=0.1 -   5.5s
[CV] svc__gamma=10.0, svc__C=0.1 .....................................
[CV] ............................ svc__gamma=10.0, svc__C=0.1 -   5.5s
[CV] svc__gamma=0.01, svc__C=1.0 .....................................
[CV] ............................ svc__gamma=0.01, svc__C=1.0 -   5.2s
[CV] svc__gamma=0.01, svc__C=1.0 .....................................
[CV] ............................ svc__gamma=0.01, svc__C=1.0 -   5.3s
[CV] svc__gamma=0.01, svc__C=1.0 .....................................
[CV] ............................ svc__gamma=0.01, svc__C=1.0 -   5.3s
[CV] svc__gamma=0.1, svc__C=1.0 ......................................
[CV] ............................. svc__gamma=0.1, svc__C=1.0 -   5.2s
[CV] svc__gamma=0.1, svc__C=1.0 ......................................
[CV] ............................. svc__gamma=0.1, svc__C=1.0 -   5.3s
[CV] svc__gamma=0.1, svc__C=1.0 ......................................
[CV] ............................. svc__gamma=0.1, svc__C=1.0 -   5.4s
[CV] svc__gamma=1.0, svc__C=1.0 ......................................
[CV] ............................. svc__gamma=1.0, svc__C=1.0 -   5.3s
[CV] svc__gamma=1.0, svc__C=1.0 ......................................
[CV] ............................. svc__gamma=1.0, svc__C=1.0 -   5.4s
[CV] svc__gamma=1.0, svc__C=1.0 ......................................
[CV] ............................. svc__gamma=1.0, svc__C=1.0 -   5.5s
[CV] svc__gamma=10.0, svc__C=1.0 .....................................
[CV] ............................ svc__gamma=10.0, svc__C=1.0 -   5.4s
[CV] svc__gamma=10.0, svc__C=1.0 .....................................
[CV] ............................ svc__gamma=10.0, svc__C=1.0 -   5.3s
[CV] svc__gamma=10.0, svc__C=1.0 .....................................
[CV] ............................ svc__gamma=10.0, svc__C=1.0 -   5.4s
[CV] svc__gamma=0.01, svc__C=10.0 ....................................
[CV] ........................... svc__gamma=0.01, svc__C=10.0 -   5.2s
[CV] svc__gamma=0.01, svc__C=10.0 ....................................
[CV] ........................... svc__gamma=0.01, svc__C=10.0 -   5.2s
[CV] svc__gamma=0.01, svc__C=10.0 ....................................
[CV] ........................... svc__gamma=0.01, svc__C=10.0 -   5.3s
[CV] svc__gamma=0.1, svc__C=10.0 .....................................
[CV] ............................ svc__gamma=0.1, svc__C=10.0 -   5.3s
[CV] svc__gamma=0.1, svc__C=10.0 .....................................
[CV] ............................ svc__gamma=0.1, svc__C=10.0 -   5.4s
[CV] svc__gamma=0.1, svc__C=10.0 .....................................
[CV] ............................ svc__gamma=0.1, svc__C=10.0 -   5.4s
[CV] svc__gamma=1.0, svc__C=10.0 .....................................
[CV] ............................ svc__gamma=1.0, svc__C=10.0 -   5.3s
[CV] svc__gamma=1.0, svc__C=10.0 .....................................
[CV] ............................ svc__gamma=1.0, svc__C=10.0 -   5.5s
[CV] svc__gamma=1.0, svc__C=10.0 .....................................
[CV] ............................ svc__gamma=1.0, svc__C=10.0 -   5.7s
[CV] svc__gamma=10.0, svc__C=10.0 ....................................
[CV] ........................... svc__gamma=10.0, svc__C=10.0 -   5.6s
[CV] svc__gamma=10.0, svc__C=10.0 ....................................
[CV] ........................... svc__gamma=10.0, svc__C=10.0 -   5.6s
[CV] svc__gamma=10.0, svc__C=10.0 ....................................
[CV] ........................... svc__gamma=10.0, svc__C=10.0 -   5.9s

[Parallel(n_jobs=1)]: Done   1 jobs       | elapsed:    5.1s
[Parallel(n_jobs=1)]: Done  36 out of  36 | elapsed:  3.3min finished

Wall time: 3min 27s
0.822666666667

In [3]:

# 然後我們採用多執行緒並行搜尋，觀察時間效能的提高情況

from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
X_train, X_test, y_train, y_test = train_test_split(news.data[:3000], news.target[:3000], test_size=0.25, random_state=33)


from sklearn.svm import SVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline

clf = Pipeline([('vect', TfidfVectorizer(stop_words='english', analyzer='word')), ('svc', SVC())])

parameters = {'svc__gamma': np.logspace(-2, 1, 4), 'svc__C': np.logspace(-1, 1, 3)}


gs = GridSearchCV(clf, parameters, verbose=2, refit=True, cv=3, n_jobs=-1)

%time _=gs.fit(X_train, y_train)
gs.best_params_, gs.best_score_
print gs.score(X_test, y_test)
# 並行化尋找最優的超引數配置，同樣獲得相同的最優解，但是訓練耗時基本上隨著CPU核的數量成倍減少。

[Parallel(n_jobs=-1)]: Done   1 jobs       | elapsed:    8.4s
[Parallel(n_jobs=-1)]: Done  22 out of  36 | elapsed:   30.3s remaining:   19.2s
[Parallel(n_jobs=-1)]: Done  36 out of  36 | elapsed:   46.8s finished

Fitting 3 folds for each of 12 candidates, totalling 36 fits
Wall time: 56.5 s
0.822666666667

In [ ]:

# 這裡需要補充的是得到這個結果的機器的配置，好讓讀者有一個對平行計算更好的瞭解。
'''
CPU: i7 四核 2.4Ghz
Memory: DDR3 1600 32GB

'''

4. 強力（流行）模型包的嘗試（高階話題）

~~~~~~~~~~~~~~~~~~~~~~~~~~~

這個話題有幾個獨立的部分，對於Xgboost和Tensorflow的試驗，需要Linux環境。待回國後用IMAC試試:)。

不過仍然有一份高階一點的NLP相關的內容可以探討，其中就有Kaggle上面利用Word2Vec對情感分析任務助益的專案。我們這裡先來分析一下。

https://www.kaggle.com/c/word2vec-nlp-tutorial

4.1. 詞向量對NLP相關任務的助益

Kaggle機器學習入門實戰 -- Titanic乘客生還預測
2018-04-03
機器學習
Python - Kaggle實戰篇1- 為什麼選擇Kaggle
2020-12-07
Python
python機器學習實戰（二）
2018-12-26
Python機器學習
機器學習實踐：如何將Spark與Python結合？
2018-06-21
機器學習SparkPython
《Python機器學習實踐》簡介
2022-09-02
Python機器學習
《Python機器學習與視覺化分析實戰》簡介
2022-09-15
Python機器學習視覺化
觀遠AI實戰 | 機器學習系統的工程實踐
2019-01-16
AI機器學習
機器學習實踐指南
2017-09-30
機器學習
Kaggle 入門並實戰房價預測
2019-05-25
機器學習PAI快速入門與業務實戰
2018-09-17
機器學習AI
k近鄰演算法python實現 -- 《機器學習實戰》
2017-11-08
演算法Python機器學習
ML-機器學習實踐
2019-03-05
機器學習
Kubeflow實戰: 入門介紹與部署實踐
2020-08-07
機器學習總結（機器學習實踐筆記）
2016-08-20
機器學習筆記
視訊教程-Python機器學習經典案例實戰-Python
2020-05-28
Python機器學習
機器學習實戰篇——用卷積神經網路演算法在Kaggle上跑個分
2018-06-18
機器學習卷積神經網路演算法
評書：《美團機器學習實踐》
2018-12-08
機器學習
01-kNN演算法實戰-(機器學習實戰)
2017-07-26
KNN演算法機器學習
【Python機器學習實戰】決策樹與整合學習（三）——整合學習（1）
2021-08-30
Python機器學習
機器學習實戰之開篇
2013-07-26
機器學習
機器學習實戰-邊學邊讀python程式碼(5)
2015-12-15
機器學習Python
機器學習實戰-邊學邊讀python程式碼(4)
2015-11-26
機器學習Python
ElasticSearch實戰－編碼實踐
2015-08-11
Elasticsearch
樸素貝葉斯演算法的python實現 -- 機器學習實戰
2017-11-19
演算法Python機器學習
決策樹ID3演算法python實現 -- 《機器學習實戰》
2017-11-13
演算法Python機器學習
【機器學習入門與實踐】合集入門必看系列，含資料探勘專案實戰，適合新人入門
2023-04-17
機器學習
機器學習落地遊戲實踐簡析
2021-02-18
機器學習遊戲
基於 KubeVela 的機器學習實踐
2022-04-07
機器學習
機器學習-ROC曲線：技術解析與實戰應用
2023-12-04
機器學習
機器學習 - 決策樹：技術全解與案例實戰
2023-12-11
機器學習
【Python機器學習實戰】決策樹和整合學習（一）
2021-08-19
Python機器學習
Spark機器學習實戰 (十一) - 文字情感分類專案實戰
2019-04-19
Spark機器學習
Python專案實戰（一）《Python程式設計從入門到實踐》
2020-12-18
Python程式設計
機器學習30天進階實戰
2020-04-04
機器學習
機器學習入門實戰疑問
2020-04-30
機器學習
《機器學習實戰》學習大綱
2018-12-01
機器學習
寫在《機器學習實戰》上市之前
2013-05-20
機器學習
機器學習基礎篇：支援向量機（SVM）理論與實踐
2021-08-20
機器學習

Python機器學習實踐與Kaggle實戰

（轉）https://mlnote.wordpress.com/2015/12/16/python%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0%E5%AE%9E%E8%B7%B5%E4%B8%8Ekaggle%E5%AE%9E%E6%88%98-machine-learning-for-kaggle-competition-in-python/

Author: Miao Fan (範淼), Ph.D. candidate on Computer Science.

Affiliation: Tsinghua University / New York University

[C.V.] [Google Scholar] [Special Talk in NYU]

Email: fanmiao.cslt.thu@gmail.com

相關文章