python機器學習速成|1|資料匯入

夜神moon發表於2018-10-17

原文網址 : https://flycode.co/archives/166939

主要任務：
①完成常見的資料匯入操作，包括資料匯入，缺失值填充
②完成常見的機器學習資料準備，包括特徵二值化和訓練集測試集的劃分等

# -*- coding: utf-8 -*-
"""
Created on Wed Oct 17 00:26:22 2018

@author: Administrator
"""
%reset -f
%clear
# In[*]
## 第1步：匯入庫
#Day 1: Data Prepocessing

#Step 1: Importing the libraries
import numpy as np
import pandas as pd
import os
os.chdir("E:multimlcoad")
# In[*]
#Step 2: Importing dataset
dataset = pd.read_csv(`coad_messa.csv`,header=0,index_col=0)

X = dataset.iloc[ : , :-1].values
Y = dataset.iloc[ : , 6].values
# In[*]
print("Step 2: Importing dataset")
print("X")
print(X)
print("Y")
print(Y)

這一步主要是匯入資料，我們的前6列為用來預測的輸入資料，包括gender， stage等等，我們將其設定為X，而輸出資料，預測目標為患者的特徵，可以是腫瘤或者正常等等，我們將其設定為Y。

 Step 2: Importing dataset
X
[[61.  0.  1.  1.  1.  1.]
 [67.  1.  3.  1.  2.  3.]
 [42.  0.  2.  2.  1.  1.]
 ...
 [44.  0.  2.  1.  2.  1.]
 [82.  1.  2.  1.  2.  1.]
 [52.  0.  2.  2.  1.  1.]]
> Y
[0. 0. 0. 1. 1. 1. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 0. 0. 0.
 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 1. 1. 1. 1. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 1.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1.
 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0.
 0. 1. 0. 0. 0. 0. 1. 1. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 1. 1.
 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1.
 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0.
>  1. 1. 0. 0. 0. 1. 0. 1. 1. 1. 0. 1. 1. 0. 1. 0. 0. 0. 1. 1. 0. 0. 0. 1.
 0. 1. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 1.
 1. 1. 0. 1. 1. 1. 0. 0. 0. 1. 1. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 1. 1. 1. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.
 0. 0. 1. 0. 0. 1. 1. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 1. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 1. 0.
 0. 1. 0. 0. 0. 0. 1. 0. 1. 1. 0. 0. 1. 1. 0. 0. 1. 0. 0. 1. 0. 1. 0. 0.
 1. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1.
 1. 1. 0.]

# In[*]
#Step 3: Handling the missing data
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = "NaN", strategy = "mean", axis = 0)
imputer = imputer.fit(X[ : , 1:3])
X[ : , 1:3] = imputer.transform(X[ : , 1:3])
# In[*]
print("---------------------")
print("Step 3: Handling the missing data")
print("step2")
print("X")
print(X)
# In[*]
#Step 4: Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[ : , 2] = labelencoder_X.fit_transform(X[ : , 2])
# In[*]
#Creating a dummy variable
onehotencoder = OneHotEncoder(categorical_features = [2])
X = onehotencoder.fit_transform(X).toarray()
labelencoder_Y = LabelEncoder()
Y =  labelencoder_Y.fit_transform(Y)
# In[*]
print("---------------------")
print("Step 4: Encoding categorical data")
print("X")
print(X)
print("Y")
print(Y)

這一步主要是將其中的資料二值化，因為我們使用的資料包括性別，眾所周知，性別是男性或者女性，雖然我們可以簡單的將其設定為0和1或者將其設定為1,2.但是

對於一些特徵工程方面，有時會用到LabelEncoder和OneHotEncoder。比如kaggle中對於性別，sex，一般的屬性值是male和female。兩個值。那麼不靠譜的方法直接用0表示male，用1表示female 了。上面說了這是不靠譜的。所以要用one-hot編碼。首先我們需要用LabelEncoder把sex這個屬性列裡面的離散屬性用數字來表示，就是上面的過程，把male,female這種不同的字元的屬性值，用數字表示。

# In[*]
#Step 5: Splitting the datasets into training sets and Test sets
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split( X , Y ,
                                                    test_size = 0.2, 
                                                    random_state = 0)

# In[*]
print("---------------------")
print("Step 5: Splitting the datasets into training sets and Test sets")
print("X_train")
print(X_train)
print("X_test")
print(X_test)
print("Y_train")
print(Y_train)
print("Y_test")
print(Y_test)
# In[*]
#Step 6: Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
# In[*]
print("---------------------")
print("Step 6: Feature Scaling")
print("X_train")
print(X_train)
print("X_test")
print(X_test)

最終我們將資料劃分成訓練集（80%）和測試集（20%）

Python大資料分析學習.Pandas 資料匯入問題 (1)
2018-05-19
Python大資料
Python學習手冊（入門&爬蟲&資料分析&機器學習&深度學習）
2021-12-20
Python爬蟲機器學習深度學習
「機器學習速成」稀疏性正則化：L1正則化
2019-06-24
機器學習
Python數學建模-02.資料匯入
2021-05-30
Python
【機器學習】資料準備--python爬蟲
2022-06-22
機器學習Python爬蟲
Python機器學習會應用到哪些庫?Python入門學習
2021-01-04
Python機器學習
機器學習導圖系列（1）：資料處理
2019-04-06
機器學習
《Python機器學習手冊：從資料預處理到深度學習》
2019-12-17
Python機器學習深度學習
Python機器學習 5個資料科學家案例解析
2018-10-16
Python機器學習資料科學
[Python]-機器學習Python入門《Python機器學習手冊》-01-向量、矩陣和陣列
2022-04-20
Python機器學習矩陣陣列
機器學習-資料清洗
2019-03-02
機器學習
機器學習大資料
2019-05-10
機器學習大資料
[python學習]機器學習 -- 感知機
2020-10-19
Python機器學習
機器學習-1
2018-04-10
機器學習
【tidyverse】part1：資料匯入
2018-03-03
Python+Matlab+機器學習+深度神經網路全套學習資料！
2018-04-16
PythonMatlab機器學習神經網路
【機器學習】--Python機器學習庫之Numpy
2018-04-06
機器學習Python
Python機器學習筆記：SVM（1）——SVM概述
2020-06-03
Python機器學習筆記
1、python機器學習基礎教程——簡述
2019-01-04
Python機器學習
機器學習-- 資料轉換
2018-11-17
機器學習
機器學習之清理資料
2020-06-16
機器學習
ClickHouse學習系列之八【資料匯入遷移&同步】
2021-07-22
做資料分析需要學習機器學習嗎？
2020-01-15
機器學習
五個給機器學習和資料科學入門者的學習建議
2019-09-16
機器學習資料科學
【Python機器學習實戰】決策樹與整合學習（三）——整合學習（1）
2021-08-30
Python機器學習
Elasticsearch批量匯入資料指令碼（python）
2018-08-11
Elasticsearch指令碼Python
Python學習筆記_函式_匯入模組
2019-11-16
Python筆記函式
100天搞定機器學習|Day1資料預處理
2019-07-05
機器學習
機器學習基礎——整合學習1
2021-03-16
機器學習
八個機器學習資料清洗
2019-06-19
機器學習
python學習筆記1—python的基本資料型別
2019-02-16
Python筆記資料型別
(五)numpy知識學習2-python資料分析與機器學習實戰(學習筆記)
2018-05-02
Python機器學習筆記
吳恩達機器學習 ex1 python實現
2020-09-25
吳恩達機器學習Python
Python機器學習預測分析核心演算法1
2020-10-08
Python機器學習演算法
Python 學習資料
2018-05-15
Python
用Python進行機器學習（附程式碼、學習資源）
2018-06-04
Python機器學習
在 Python 中儲存和載入機器學習模型
2021-09-26
Python機器學習模型
Scikit-learn 機器學習庫介紹！【Python入門】
2021-04-07
機器學習Python

python機器學習速成|1|資料匯入

相關文章