十分鐘上手sklearn 安裝，獲取資料，資料預處理

weixin_34107955發表於2018-02-06

原文網址 : https://blog.csdn.net/weixin_34107955/article/details/86943431

更多幹貨就在我的個人部落格 http://blackblog.tech 歡迎關注！

sklearn是機器學習中一個常用的python第三方模組，對常用的機器學習演算法進行了封裝
其中包括：
1.分類（Classification）
2.迴歸（Regression）
3.聚類（Clustering）
4.資料降維（Dimensionality reduction）
5.常用模型（Model selection）
6.資料預處理（Preprocessing）
本文將從sklearn的安裝開始講解，由淺入深，逐步上手sklearn。

sklearn官網：http://scikit-learn.org/stable/index.html
sklearn API：http://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing

skleran安裝

sklearn的目前版本是0.19.1
依賴包：
Python (>=2.6或>=3.3)
NumPy(>=1.6.1)
SciPy(>=0.9)

使用pip安裝，terminal直接執行即可

pip install -U scikit-learn

使用Anaconda安裝，推薦Anaconda，因為裡面已經內建了NumPy，SciPy等常用工具

conda install scikit-learn

安裝完成後可以在python中檢查一下版本，import sklearn不報錯，則表示安裝成功

>>import sklearn
>>sklearn.__version__
'0.19.1'

獲取資料

機器學習演算法往往需要大量的資料，在skleran中獲取資料通常採用兩種方式，一種是使用自帶的資料集，另一種是建立資料集

匯入資料集

sklearn自帶了很多資料集，可以用來對演算法進行測試分析，免去了自己再去找資料集的煩惱
其中包括：
鳶尾花資料集:load_iris()
手寫數字資料集:load_digitals()
糖尿病資料集:load_diabetes()
乳腺癌資料集:load_breast_cancer()
波士頓房價資料集:load_boston()
體能訓練資料集:load_linnerud()

這裡以鳶尾花資料集為例匯入資料集

#匯入sklearn的資料集
import sklearn.datasets as sk_datasets
iris = sk_datasets.load_iris()
iris_X = iris.data #匯入資料
iris_y = iris.target #匯入標籤

建立資料集

使用skleran的樣本生成器(samples generator)可以建立資料，sklearn.datasets.samples_generator中包含了大量建立樣本資料的方法。

這裡以分類問題建立樣本資料

import sklearn.datasets.samples_generator as sk_sample_generator
X,y=sk_sample_generator.make_classification(n_samples=6,n_features=5,n_informative=2,n_redundant=3,n_classes=2,n_clusters_per_class=2,scale=1,random_state=20)
for x_,y_ in zip(X,y):
    print(y_,end=": ")
    print(x_)

引數說明：
n_features :特徵個數= n_informative（） + n_redundant + n_repeated
n_informative：多資訊特徵的個數
n_redundant：冗餘資訊，informative特徵的隨機線性組合
n_repeated ：重複資訊，隨機提取n_informative和n_redundant 特徵
n_classes：分類類別
n_clusters_per_class ：某一個類別是由幾個cluster構成的
random_state：隨機種子，使得實驗可重複
n_classes*n_clusters_per_class 要小於或等於 2^n_informative

列印結果：

0: [ 0.64459602  0.92767918 -1.32091378 -1.25725859 -0.74386837]
0: [ 1.66098845  2.22206181 -2.86249859 -3.28323172 -1.62389676]
0: [ 0.27019475 -0.12572907  1.1003977  -0.6600737   0.58334745]
1: [-0.77182836 -1.03692724  1.34422289  1.52452016  0.76221055]
1: [-0.1407289   0.32675611 -1.41296696  0.4113583  -0.75833145]
1: [-0.76656634 -0.35589955 -0.83132182  1.68841011 -0.4153836 ]

資料集的劃分

機器學習的過程正往往需要對資料集進行劃分，常分為訓練集，測試集。sklearn中的model_selection為我們提供了劃分資料集的方法。
以鳶尾花資料集為例進行劃分

import sklearn.model_selection as sk_model_selection
X_train,X_test,y_train,y_test = sk_model_selection.train_test_split(iris_X,iris_y,train_size=0.3,random_state=20)

引數說明：
arrays：樣本陣列，包含特徵向量和標籤
test_size：
　　float-獲得多大比重的測試樣本（預設：0.25）
　　int - 獲得多少個測試樣本
train_size: 同test_size
random_state:int - 隨機種子（種子固定，實驗可復現）
shuffle - 是否在分割之前對資料進行洗牌（預設True）

後面我們訓練模型使用的資料集都基於此

資料預處理

我們為什麼要進行資料預處理？
通常，真實生活中，我們獲得的資料中往往存在很多的無用資訊，甚至存在錯誤資訊，而機器學習中有一句話叫做"Garbage in，Garbage out"，資料的健康程度對於演算法結果的影響極大。資料預處理就是讓那些冗餘混亂的源資料變得能滿足其應用要求。
當然，僅僅是資料預處理的方法就可以寫好幾千字的文章了，在這裡只談及幾個基礎的資料預處理的方法。
skleran中為我們提供了一個資料預處理的package：preprocessing，我們直接匯入即可

import sklearn.preprocessing as sk_preprocessing

下面的例子我們使用:[[1, -1, 2], [0, 2, -1], [0, 1, -2]]做為初始資料。

資料的歸一化

基於mean和std的標準化

scaler = sk_preprocessing.StandardScaler().fit(X)
new_X = scaler.transform(X)
print('基於mean和std的標準化:',new_X)

列印結果:

基於mean和std的標準化:
 [[ 1.41421356 -1.33630621  1.37281295]
 [-0.70710678  1.06904497 -0.39223227]
 [-0.70710678  0.26726124 -0.98058068]]

規範化到一定區間內 feature_range為資料規範化的範圍

scaler = sk_preprocessing.MinMaxScaler(feature_range=(0,1)).fit(X)
new_X=scaler.transform(X)
print('規範化到一定區間內',new_X)