基於 Python 和 Scikit-Learn 的機器學習介紹

xeonqq發表於2015-07-21

你好，%使用者名稱%！

我叫Alex，我在機器學習和網路圖分析（主要是理論）有所涉獵。我同時在為一家俄羅斯移動運營商開發大資料產品。這是我第一次在網上寫文章，不喜勿噴。

現在，很多人想開發高效的演算法以及參加機器學習的競賽。所以他們過來問我：”該如何開始？”。一段時間以前，我在一個俄羅斯聯邦政府的下屬機構中領導了媒體和社交網路大資料分析工具的開發。我仍然有一些我團隊使用過的文件，我樂意與你們分享。前提是讀者已經有很好的數學和機器學習方面的知識（我的團隊主要由MIPT（莫斯科物理與技術大學）和資料分析學院的畢業生構成）。

這篇文章是對資料科學的簡介，這門學科最近太火了。機器學習的競賽也越來越多（如，Kaggle, TudedIT），而且他們的資金通常很可觀。

R和Python是提供給資料科學家的最常用的兩種工具。每一個工具都有其優缺點，但Python最近在各個方面都有所勝出（僅為鄙人愚見，雖然我兩者都用）。這一切的發生是因為Scikit-Learn庫的騰空出世，它包含有完善的文件和豐富的機器學習演算法。

請注意，我們將主要在這篇文章中探討機器學習演算法。通常用Pandas包去進行主資料分析會比較好，而且這很容易你自己完成。所以，讓我們集中精力在實現上。為了確定性，我們假設有一個特徵-物件矩陣作為輸入，被存在一個*.csv檔案中。

資料載入

首先，資料要被載入到記憶體中，才能對其操作。Scikit-Learn庫在它的實現用使用了NumPy陣列，所以我們將用NumPy來載入*.csv檔案。讓我們從UCI Machine Learning Repository下載其中一個資料集。

import numpy as np
import urllib
# url with dataset
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
# download the file
raw_data = urllib.urlopen(url)
# load the CSV file as a numpy matrix
dataset = np.loadtxt(raw_data, delimiter=",")
# separate the data from the target attributes
X = dataset[:,0:7]
y = dataset[:,8]

import numpy as np

import urllib

# url with dataset

url = "http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"

# download the file

raw_data = urllib.urlopen(url)

# load the CSV file as a numpy matrix

dataset = np.loadtxt(raw_data, delimiter=",")

# separate the data from the target attributes

X = dataset[:,0:7]

y = dataset[:,8]

我們將在下面所有的例子裡使用這個資料組，換言之，使用X特徵物陣列和y目標變數的值。

資料標準化

我們都知道大多數的梯度方法（幾乎所有的機器學習演算法都基於此）對於資料的縮放很敏感。因此，在執行演算法之前，我們應該進行標準化，或所謂的規格化。標準化包括替換所有特徵的名義值，讓它們每一個的值在0和1之間。而對於規格化，它包括資料的預處理，使得每個特徵的值有0和1的離差。Scikit-Learn庫已經為其提供了相應的函式。

from sklearn import preprocessing
# normalize the data attributes
normalized_X = preprocessing.normalize(X)
# standardize the data attributes
standardized_X = preprocessing.scale(X)

from sklearn import preprocessing

# normalize the data attributes

normalized_X = preprocessing.normalize(X)

# standardize the data attributes

standardized_X = preprocessing.scale(X)

特徵的選取

毫無疑問，解決一個問題最重要的是是恰當選取特徵、甚至創造特徵的能力。這叫做特徵選取和特徵工程。雖然特徵工程是一個相當有創造性的過程，有時候更多的是靠直覺和專業的知識，但對於特徵的選取，已經有很多的演算法可供直接使用。如樹演算法就可以計算特徵的資訊量。

from sklearn import metrics
from sklearn.ensemble import ExtraTreesClassifier
model = ExtraTreesClassifier()
model.fit(X, y)
# display the relative importance of each attribute
print(model.feature_importances_)

from sklearn import metrics

from sklearn.ensemble import ExtraTreesClassifier

model = ExtraTreesClassifier()

model.fit(X, y)

# display the relative importance of each attribute

print(model.feature_importances_)

其他所有的方法都是基於對特徵子集的高效搜尋，從而找到最好的子集，意味著演化了的模型在這個子集上有最好的質量。遞迴特徵消除演算法（RFE）是這些搜尋演算法的其中之一，Scikit-Learn庫同樣也有提供。

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
# create the RFE model and select 3 attributes
rfe = RFE(model, 3)
rfe = rfe.fit(X, y)
# summarize the selection of the attributes
print(rfe.support_)
print(rfe.ranking_)

from sklearn.feature_selection import RFE

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

# create the RFE model and select 3 attributes

rfe = RFE(model, 3)

rfe = rfe.fit(X, y)

# summarize the selection of the attributes

print(rfe.support_)

print(rfe.ranking_)

演算法的開發

正像我說的，Scikit-Learn庫已經實現了所有基本機器學習的演算法。讓我來瞧一瞧它們中的一些。

邏輯迴歸

大多數情況下被用來解決分類問題（二元分類），但多類的分類（所謂的一對多方法）也適用。這個演算法的優點是對於每一個輸出的物件都有一個對應類別的概率。

from sklearn import metrics
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X, y)
print(model)
# make predictions
expected = y
predicted = model.predict(X)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

from sklearn import metrics

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

model.fit(X, y)

print(model)

# make predictions

expected = y

predicted = model.predict(X)

# summarize the fit of the model

print(metrics.classification_report(expected, predicted))

print(metrics.confusion_matrix(expected, predicted))

樸素貝葉斯

它也是最有名的機器學習的演算法之一，它的主要任務是恢復訓練樣本的資料分佈密度。這個方法通常在多類的分類問題上表現的很好。

from sklearn import metrics
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X, y)
print(model)
# make predictions
expected = y
predicted = model.predict(X)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

from sklearn import metrics

from sklearn.naive_bayes import GaussianNB

model = GaussianNB()

model.fit(X, y)

print(model)

# make predictions

expected = y

predicted = model.predict(X)

# summarize the fit of the model

print(metrics.classification_report(expected, predicted))

print(metrics.confusion_matrix(expected, predicted))

k-最近鄰

kNN（k-最近鄰）方法通常用於一個更復雜分類演算法的一部分。例如，我們可以用它的估計值做為一個物件的特徵。有時候，一個簡單的kNN演算法在良好選擇的特徵上會有很出色的表現。當引數（主要是metrics）被設定得當，這個演算法在迴歸問題中通常表現出最好的質量。

from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier
# fit a k-nearest neighbor model to the data
model = KNeighborsClassifier()
model.fit(X, y)
print(model)
# make predictions
expected = y
predicted = model.predict(X)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

from sklearn import metrics

from sklearn.neighbors import KNeighborsClassifier

# fit a k-nearest neighbor model to the data

model = KNeighborsClassifier()

model.fit(X, y)

print(model)

# make predictions

expected = y

predicted = model.predict(X)

# summarize the fit of the model

print(metrics.classification_report(expected, predicted))

print(metrics.confusion_matrix(expected, predicted))

決策樹

分類和迴歸樹（CART）經常被用於這麼一類問題，在這類問題中物件有可分類的特徵且被用於迴歸和分類問題。決策樹很適用於多類分類。

from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
# fit a CART model to the data
model = DecisionTreeClassifier()
model.fit(X, y)
print(model)
# make predictions
expected = y
predicted = model.predict(X)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

from sklearn import metrics

from sklearn.tree import DecisionTreeClassifier

# fit a CART model to the data

model = DecisionTreeClassifier()

model.fit(X, y)

print(model)

# make predictions

expected = y

predicted = model.predict(X)

# summarize the fit of the model

print(metrics.classification_report(expected, predicted))

print(metrics.confusion_matrix(expected, predicted))

支援向量機

SVM（支援向量機）是最流行的機器學習演算法之一，它主要用於分類問題。同樣也用於邏輯迴歸，SVM在一對多方法的幫助下可以實現多類分類。

from sklearn import metrics
from sklearn.svm import SVC
# fit a SVM model to the data
model = SVC()
model.fit(X, y)
print(model)
# make predictions
expected = y
predicted = model.predict(X)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

from sklearn import metrics

from sklearn.svm import SVC

# fit a SVM model to the data

model = SVC()

model.fit(X, y)

print(model)

# make predictions

expected = y

predicted = model.predict(X)

# summarize the fit of the model

print(metrics.classification_report(expected, predicted))

print(metrics.confusion_matrix(expected, predicted))

除了分類和迴歸問題，Scikit-Learn還有海量的更復雜的演算法，包括了聚類，以及建立混合演算法的實現技術，如Bagging和Boosting。

如何優化演算法的引數

在編寫高效的演算法的過程中最難的步驟之一就是正確引數的選擇。一般來說如果有經驗的話會容易些，但無論如何，我們都得尋找。幸運的是Scikit-Learn提供了很多函式來幫助解決這個問題。

作為一個例子，我們來看一下規則化引數的選擇，在其中不少數值被相繼搜尋了：

import numpy as np
from sklearn.linear_model import Ridge
from sklearn.grid_search import GridSearchCV
# prepare a range of alpha values to test
alphas = np.array([1,0.1,0.01,0.001,0.0001,0])
# create and fit a ridge regression model, testing each alpha
model = Ridge()
grid = GridSearchCV(estimator=model, param_grid=dict(alpha=alphas))
grid.fit(X, y)
print(grid)
# summarize the results of the grid search
print(grid.best_score_)
print(grid.best_estimator_.alpha)

import numpy as np

from sklearn.linear_model import Ridge

from sklearn.grid_search import GridSearchCV

# prepare a range of alpha values to test

alphas = np.array([1,0.1,0.01,0.001,0.0001,0])

# create and fit a ridge regression model, testing each alpha

model = Ridge()

grid = GridSearchCV(estimator=model, param_grid=dict(alpha=alphas))

grid.fit(X, y)

print(grid)

# summarize the results of the grid search

print(grid.best_score_)

print(grid.best_estimator_.alpha)

有時候隨機地從既定的範圍內選取一個引數更為高效，估計在這個引數下演算法的質量，然後選出最好的。

import numpy as np
from scipy.stats import uniform as sp_rand
from sklearn.linear_model import Ridge
from sklearn.grid_search import RandomizedSearchCV
# prepare a uniform distribution to sample for the alpha parameter
param_grid = {'alpha': sp_rand()}
# create and fit a ridge regression model, testing random alpha values
model = Ridge()
rsearch = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=100)
rsearch.fit(X, y)
print(rsearch)
# summarize the results of the random parameter search
print(rsearch.best_score_)
print(rsearch.best_estimator_.alpha)

import numpy as np

from scipy.stats import uniform as sp_rand

from sklearn.linear_model import Ridge

from sklearn.grid_search import RandomizedSearchCV

# prepare a uniform distribution to sample for the alpha parameter

param_grid = {'alpha': sp_rand()}

# create and fit a ridge regression model, testing random alpha values

model = Ridge()

rsearch = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=100)

rsearch.fit(X, y)

print(rsearch)

# summarize the results of the random parameter search

print(rsearch.best_score_)

print(rsearch.best_estimator_.alpha)

至此我們已經看了整個使用Scikit-Learn庫的過程，除了將結果再輸出到一個檔案中。這個就作為你的一個練習吧，和R相比Python的一大優點就是它有很棒的文件說明。

在下一篇文章中，我們將深入探討其他問題。我們尤其是要觸及一個很重要的東西——特徵的建造。我真心地希望這份材料可以幫助新手資料科學家儘快開始解決實踐中的機器學習問題。最後，我祝願那些剛剛開始參加機器學習競賽的朋友擁有耐心以及馬到成功！

Scikit-learn 機器學習庫介紹！【Python入門】
2021-04-07
機器學習Python
（一）機器學習和機器學習介紹
2021-09-09
機器學習
《scikit-learn機器學習實戰》簡介
2022-06-22
機器學習
Scikit-Learn 與 TensorFlow 機器學習實用指南學習筆記1 — 機器學習基礎知識簡介
2018-11-20
機器學習筆記
最通俗的機器學習介紹
2018-09-03
機器學習
python基礎學習-埠介紹說明
2020-01-26
Python
機器學習基本函式介紹
2020-12-05
機器學習函式
scikit-learn介紹
2020-04-06
學習python前言介紹
2018-06-13
Python
機器學習入門之sklearn介紹
2019-03-05
機器學習
Redis和MongoDB優缺點介紹!Python學習
2021-04-09
RedisMongoDBPython
《Python機器學習實踐》簡介
2022-09-02
Python機器學習
【機器學習】線性迴歸原理介紹
2019-01-17
機器學習
【機器學習】機器學習簡介
2018-11-29
機器學習
【機器學習基礎】關於深度學習的Tips
2021-11-12
機器學習深度學習
基於 KubeVela 的機器學習實踐
2022-04-07
機器學習
【機器學習基礎】半監督學習簡介
2021-12-23
機器學習
Scikit-Learn 與 TensorFlow 機器學習實用指南學習筆記2 — 機器學習的主要挑戰
2018-11-26
機器學習筆記
Python介紹和基礎運用
2024-11-08
Python
人工智慧-機器學習-Python-第三方庫-scikit-learn(用於特徵工程)
2020-11-17
人工智慧機器學習Python特徵工程
入門系列之Scikit-learn在Python中構建機器學習分類器
2019-02-27
Python機器學習
使用scikit-learn機器學習庫裡面的xgboost
2020-12-05
機器學習
【機器學習】多項式迴歸原理介紹
2019-03-10
機器學習
機器學習模型可解釋性的詳盡介紹
2019-10-30
機器學習模型
沒有Python基礎，如何學習用Python寫機器學習
2024-03-27
Python機器學習
《Spark 3.0大資料分析與挖掘：基於機器學習》簡介
2022-03-20
Spark大資料機器學習
基於C#的機器學習--微基準測試和啟用功能
2019-07-17
C#機器學習
用於資料科學的幾種Python裝飾器介紹 - Bytepawn
2022-05-24
資料科學Python
阿里雲機器學習 AutoML 引擎介紹與應用
2023-02-23
阿里機器學習TOML
機器學習之簡單介紹啟用函式
2018-04-09
機器學習函式
基於深度學習的機器人目標識別和跟蹤
2022-08-02
深度學習機器人
Python學習之路2-列表介紹
2018-05-29
Python
機器學習簡介
2024-08-25
機器學習
[python學習]機器學習 -- 感知機
2020-10-19
Python機器學習
機器學習之基於xgboost的特徵篩選
2020-03-19
機器學習特徵
Scikit-learn可擴充套件學習簡介
2024-04-04
套件
Scikit-Learn機器學習實踐——垃圾簡訊識別
2019-03-02
機器學習
python描述器介紹
2020-10-02
Python
張量tensor：機器學習的基本資料結構介紹 - Santiago
2020-12-28
機器學習資料結構Go

基於 Python 和 Scikit-Learn 的機器學習介紹

資料載入

資料標準化

邏輯迴歸

k-最近鄰

如何優化演算法的引數

相關文章