機器學習基礎09DAY

ThankCAT發表於2023-03-31

分類演算法之邏輯迴歸

邏輯迴歸(Logistic Regression),簡稱LR。它的特點是能夠是我們的特徵輸入集合轉化為0和1這兩類的機率。一般來說,迴歸不用在分類問題上,因為迴歸是連續型模型,而且受噪聲影響比較大。如果非要應用進入,可以使用邏輯迴歸。瞭解過線性迴歸之後再來看邏輯迴歸可以更好的理解。

優點:計算代價不高,易於理解和實現

缺點:容易欠擬合,分類精度不高

適用資料:數值型和標稱型

邏輯迴歸

對於迴歸問題後面會介紹,Logistic迴歸本質上是線性迴歸,只是在特徵到結果的對映中加入了一層函式對映,即先把特徵線性求和,然後使用函式g(z)將最為假設函式來預測。g(z)可以將連續值對映到0和1上。Logistic迴歸用來分類0/1問題,也就是預測結果屬於0或者1的二值分類問題

對映函式為:

對映出來的效果如下如:

sklearn.linear_model.LogisticRegression

邏輯迴歸類

class sklearn.linear_model.LogisticRegression(penalty='l2', dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='liblinear', max_iter=100, multi_class='ovr', verbose=0, warm_start=False, n_jobs=1)
  """
  :param C: float,預設值:1.0

  :param penalty: 特徵選擇的方式

  :param tol: 公差停止標準
  """

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_digits
from sklearn.linear_model import LogisticRegression
LR = LogisticRegression(C=1.0, penalty='l1', tol=0.01)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
LR.fit(X_train,y_train)
LR.predict(X_test)
LR.score(X_test,y_test)
0.96464646464646464
# c=100.0
0.96801346801346799

屬性

coef_

決策功能的特徵係數

Cs_

陣列C,即用於交叉驗證的正則化引數值的倒數

特點分析

線性分類器可以說是最為基本和常用的機器學習模型。儘管其受限於資料特徵與分類目標之間的線性假設,我們仍然可以在科學研究與工程實踐中把線性分類器的表現效能作為基準。

惡性良性腫瘤預測

# 匯入模組
from sklearn.linear_model import LogisticRegression #邏輯迴歸
from sklearn.preprocessing import StandardScaler #標準化
from sklearn.metrics import classification_report # 召回率
from sklearn.model_selection import train_test_split # 資料分割
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# 建立特徵名列表
feature_names = ["Sample code number", "Clump Thickness",
                 "Uniformity of Cell Size", "Uniformity of Cell Shape", "Marginal Adhesion", "Single Epithelial Cell Size", "Bare Nuclei", "Bland Chromatin", "Normal Nucleoli", "Mitoses", "Class"]
# 讀取資料
data = pd.read_csv("./breast-cancer-wisconsin.data", names=feature_names)
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 11 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   Sample code number           699 non-null    int64 
 1   Clump Thickness              699 non-null    int64 
 2   Uniformity of Cell Size      699 non-null    int64 
 3   Uniformity of Cell Shape     699 non-null    int64 
 4   Marginal Adhesion            699 non-null    int64 
 5   Single Epithelial Cell Size  699 non-null    int64 
 6   Bare Nuclei                  699 non-null    object
 7   Bland Chromatin              699 non-null    int64 
 8   Normal Nucleoli              699 non-null    int64 
 9   Mitoses                      699 non-null    int64 
 10  Class                        699 non-null    int64 
dtypes: int64(10), object(1)
memory usage: 60.2+ KB
# 資料處理
# 將所有?更改成nan
data = data.replace(to_replace="?", value=np.nan)
# 將 Bare Nuclei 型別轉換成float型別
data["Bare Nuclei"] = data["Bare Nuclei"].astype("float")
# 將所有nan填充成他所在列的均值
data = data.fillna(data.mean())
# 檢視是否還有NAN
data.isna().sum()
Sample code number             0
Clump Thickness                0
Uniformity of Cell Size        0
Uniformity of Cell Shape       0
Marginal Adhesion              0
Single Epithelial Cell Size    0
Bare Nuclei                    0
Bland Chromatin                0
Normal Nucleoli                0
Mitoses                        0
Class                          0
dtype: int64
# 資料標準化
x_train, x_test, y_train, y_test = train_test_split(data.iloc[:,1:10], data.loc[:,"Class"])
# 資料標準化 
std = StandardScaler().fit(x_train)
x_train = std.transform(x_train)
x_test = std.transform(x_test)
# 邏輯迴歸預測腫瘤 
# 例項化param:c 正則化力度,penalty 正則化方式
log = LogisticRegression(penalty="l2",C=1)
# 訓練測試集和訓練集
log.fit(x_train, y_train)
LogisticRegression(C=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# 檢視預測得分
score = log.score(x_test, y_test)
score
0.9542857142857143
# 檢視預測的結果
y_predict = log.predict(x_test)
y_predict
array([2, 2, 2, 4, 2, 2, 2, 2, 2, 2, 2, 4, 2, 2, 4, 2, 4, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 4, 2, 2, 2, 2, 2, 2, 2, 4, 2, 4, 4, 2, 4, 2, 4, 2, 2,
       2, 2, 4, 4, 4, 2, 2, 4, 2, 2, 4, 4, 2, 4, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 4, 2, 2, 4, 4, 2, 2, 4, 2, 2, 2, 2, 2, 4, 2, 4, 2, 2, 2,
       2, 2, 2, 2, 2, 4, 2, 4, 2, 4, 2, 2, 2, 4, 4, 2, 2, 2, 2, 4, 2, 4,
       2, 2, 2, 2, 4, 4, 4, 2, 2, 2, 4, 4, 2, 4, 4, 2, 2, 4, 2, 2, 2, 4,
       2, 2, 2, 4, 2, 4, 2, 2, 2, 2, 4, 2, 2, 2, 4, 4, 2, 4, 2, 2, 4, 4,
       2, 4, 2, 4, 2, 2, 2, 2, 4, 2, 2, 2, 2, 2, 4, 4, 4, 2, 2, 2, 2],
      dtype=int64)
# 檢視回歸係數
log.coef_
array([[ 1.5859245 , -0.1730868 ,  0.79379743,  1.02638845,  0.00220408,
         1.44509346,  1.03614595,  0.6169744 ,  1.10281087]])
# 檢視召回率
recall = classification_report(y_test, log.predict(x_test),labels=[2,4], target_names=["良性腫瘤","惡性腫瘤"])
print(recall)
              precision    recall  f1-score   support

        良性腫瘤       0.98      0.96      0.97       124
        惡性腫瘤       0.91      0.94      0.92        51

    accuracy                           0.95       175
   macro avg       0.94      0.95      0.95       175
weighted avg       0.96      0.95      0.95       175

非監督學習之k-means

K-means通常被稱為勞埃德演算法,這在資料聚類中是最經典的,也是相對容易理解的模型。演算法執行的過程分為4個階段。

  • 1.首先,隨機設K個特徵空間內的點作為初始的聚類中心。
  • 2.然後,對於根據每個資料的特徵向量,從K個聚類中心中尋找距離最近的一個,並且把該資料標記為這個聚類中心。
  • 3.接著,在所有的資料都被標記過聚類中心之後,根據這些資料新分配的類簇,透過取分配給每個先前質心的所有樣本的平均值來建立新的質心重,新對K個聚類中心做計算。
  • 4.最後,計算舊和新質心之間的差異,如果所有的資料點從屬的聚類中心與上一次的分配的類簇沒有變化,那麼迭代就可以停止,否則回到步驟2繼續迴圈。

K均值等於具有小的全對稱協方差矩陣的期望最大化演算法

sklearn.cluster.KMeans

class sklearn.cluster.KMeans(n_clusters=8, init='k-means++', n_init=10, max_iter=300, tol=0.0001, precompute_distances='auto', verbose=0, random_state=None, copy_x=True, n_jobs=1, algorithm='auto')
  """
  :param n_clusters:要形成的聚類數以及生成的質心數

  :param init:初始化方法,預設為'k-means ++',以智慧方式選擇k-均值聚類的初始聚類中心,以加速收斂;random,從初始質心資料中隨機選擇k個觀察值(行

  :param n_init:int,預設值:10使用不同質心種子執行k-means演算法的時間。最終結果將是n_init連續執行在慣性方面的最佳輸出。

  :param n_jobs:int用於計算的作業數量。這可以透過平行計算每個執行的n_init。如果-1使用所有CPU。如果給出1,則不使用任何平行計算程式碼,這對除錯很有用。對於-1以下的n_jobs,使用(n_cpus + 1 + n_jobs)。因此,對於n_jobs = -2,所有CPU都使用一個。

  :param random_state:隨機數種子,預設為全域性numpy隨機數生成器
  """
from sklearn.cluster import KMeans
import numpy as np
X = np.array([[1, 2], [1, 4], [1, 0],[4, 2], [4, 4], [4, 0]])
kmeans = KMeans(n_clusters=2, random_state=0)

方法

fit(X,y=None)

使用X作為訓練資料擬合模型

kmeans.fit(X)

predict(X)

預測新的資料所在的類別

kmeans.predict([[0, 0], [4, 4]])
array([0, 1], dtype=int32)

屬性

cluster*centers*

叢集中心的點座標

kmeans.cluster_centers_
array([[ 1.,  2.],
       [ 4.,  2.]])

labels_

每個點的類別

kmeans.labels_

使用者型別聚類

# 匯入模組
import pandas as pd
from sklearn.cluster import KMeans #聚類演算法
from sklearn.decomposition import PCA # 資料降維
# products.csv                               商品資訊
# order_products__prior.csv                 訂單與商品資訊
# orders.csv                                   使用者的訂單資訊
# aisles.csv                                    商品所屬具體物品類別
# 讀取資料
products = pd.read_csv("../data/products.csv")
order_products__prior = pd.read_csv("../data/order_products__prior.csv")
orders = pd.read_csv("../data/orders.csv")
aisles = pd.read_csv("../data/aisles.csv")
products.columns
Index(['product_id', 'product_name', 'aisle_id', 'department_id'], dtype='object')
order_products__prior.columns
Index(['order_id', 'product_id', 'add_to_cart_order', 'reordered'], dtype='object')
orders.columns
Index(['order_id', 'user_id', 'eval_set', 'order_number', 'order_dow',
       'order_hour_of_day', 'days_since_prior_order'],
      dtype='object')
aisles.columns
Index(['aisle_id', 'aisle'], dtype='object')
data = pd.merge(products, order_products__prior, on=["product_id","product_id"])
data = pd.merge(data, orders, on=["order_id", "order_id"])
data = pd.merge(data, aisles, on=["aisle_id", "aisle_id"])
data.head()
product_id product_name aisle_id department_id order_id add_to_cart_order reordered user_id eval_set order_number order_dow order_hour_of_day days_since_prior_order aisle
0 1 Chocolate Sandwich Cookies 61 19 1107 7 0 38259 prior 2 1 11 7.0 cookies cakes
1 1 Chocolate Sandwich Cookies 61 19 5319 3 1 196224 prior 65 1 14 1.0 cookies cakes
2 1 Chocolate Sandwich Cookies 61 19 7540 4 1 138499 prior 8 0 14 7.0 cookies cakes
3 1 Chocolate Sandwich Cookies 61 19 9228 2 0 79603 prior 2 2 10 30.0 cookies cakes
4 1 Chocolate Sandwich Cookies 61 19 9273 30 0 50005 prior 1 1 15 NaN cookies cakes
# 交叉表
corss = pd.crosstab(data["user_id"], data["aisle"])
corss

aisle air fresheners candles asian foods baby accessories baby bath body care baby food formula bakery desserts baking ingredients baking supplies decor beauty beers coolers ... spreads tea tofu meat alternatives tortillas flat bread trail mix snack mix trash bags liners vitamins supplements water seltzer sparkling water white wines yogurt
user_id
1 0 0 0 0 0 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 1
2 0 3 0 0 0 0 2 0 0 0 ... 3 1 1 0 0 0 0 2 0 42
3 0 0 0 0 0 0 0 0 0 0 ... 4 1 0 0 0 0 0 2 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 1 0 0 0 1 0 0
5 0 2 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 3
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
206205 0 0 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 5
206206 0 4 0 0 0 0 4 1 0 0 ... 1 0 0 0 0 1 0 1 0 0
206207 0 0 0 0 1 0 0 0 0 0 ... 3 4 0 2 1 0 0 11 0 15
206208 0 3 0 0 3 0 4 0 0 0 ... 5 0 0 7 0 0 0 0 0 33
206209 0 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 0 0 3

206209 rows × 134 columns

# 資料降維
pca = PCA(n_components=0.9)
data_pca = pca.fit_transform(corss)
data_pca
array([[-2.42156587e+01,  2.42942720e+00, -2.46636975e+00, ...,
         6.86800336e-01,  1.69439402e+00, -2.34323022e+00],
       [ 6.46320807e+00,  3.67511165e+01,  8.38255336e+00, ...,
         4.12121252e+00,  2.44689740e+00, -4.28348478e+00],
       [-7.99030162e+00,  2.40438257e+00, -1.10300641e+01, ...,
         1.77534453e+00, -4.44194030e-01,  7.86665571e-01],
       ...,
       [ 8.61143331e+00,  7.70129866e+00,  7.95240226e+00, ...,
        -2.74252456e+00,  1.07112531e+00, -6.31925661e-02],
       [ 8.40862199e+01,  2.04187340e+01,  8.05410372e+00, ...,
         7.27554259e-01,  3.51339470e+00, -1.79079914e+01],
       [-1.39534562e+01,  6.64621821e+00, -5.23030367e+00, ...,
         8.25329076e-01,  1.38230701e+00, -2.41942061e+00]])
# 例項化Kmeans
km = KMeans(n_clusters=4)
# 訓練資料
km.fit(data_pca)
D:\DeveloperTools\Anaconda\lib\site-packages\sklearn\cluster\_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
KMeans(n_clusters=4)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# 檢視聚合結果
km_preduct = km.predict(data_pca)
km_preduct
array([0, 3, 0, ..., 3, 1, 0])
# 顯示聚類的結果
plt.figure(figsize=(10,10))
colored = ['orange', 'green', 'blue', 'purple']
colr = [colored[i] for i in km_preduct]
plt.scatter(data_pca[:, 1], data_pca[:, 20], color=colr)
plt.show()

相關文章