機器學習(六):迴歸分析——鳶尾花多變數回歸、邏輯迴歸三分類只用numpy,sigmoid、實現RANSAC 線性擬合

孤飛發表於2023-04-13

[實驗1 迴歸分析]

一、 預備知識

  1. 使用梯度下降法求解多變數回歸問題
機器學習(六):迴歸分析——鳶尾花多變數回歸、邏輯迴歸三分類只用numpy,sigmoid、實現RANSAC 線性擬合

img

  1. 資料集

Iris 鳶尾花資料集是一個經典資料集,在統計學習和機器學習領域都經常被用作示例。資料集內包含 3 類共 150 條記錄,每類各 50 個資料,每條記錄都有 4 項特徵:花萼長度、花萼寬度、花瓣長度、花瓣寬度,可以透過這4個特徵預測鳶尾花卉屬於(iris-setosa, iris-versicolour, iris-virginica)中的哪一品種。

img

sepal length:花萼長度,單位cm;sepal width:花萼寬度,單位cm
petal length:花瓣長度,單位cm;petal width:花瓣寬度,單位cm

種類:setosa(山鳶尾),versicolor(雜色鳶尾),virginica(弗吉尼亞鳶尾)

img

我們用100組資料作為訓練資料iris-train.txt,50組資料作為測試資料iris-test.txt。

'setosa' 'versicolor' 'virginica'

0 1 2

二、 實驗目的

  1. 掌握梯度下降法的原理及設計;

  2. 掌握利用梯度下降法解決迴歸類問題。

三、 實驗內容

資料讀取:(前4列是特徵x,第5列是標籤y)

import numpy as np
train_data  = np.loadtxt('iris-train.txt', delimiter='\t')
x_train=np.array(train_data[:,:4])
y_train=np.array(train_data[:,4])
  1. 設計多變數回歸分析方法,利用訓練資料iris-train.txt求解引數

思路:首先確定h(x)函式,由於是4個特徵,我們可以取5個引數

img

  1. 設計邏輯迴歸分析方法,利用訓練資料iris-train.txt求解引數

思路:首先確定h(x)函式,由於是4個特徵,我們可以取5個引數

img

  1. 利用測試資料iris-test.txt,統計邏輯迴歸對測試資料的識別正確率 ACC,Precision,Recall。

四、 操作方法和實驗步驟

​ 1.對於多變數回歸分析:

\[\alpha \frac{\partial}{\partial \theta _j}L(\theta _0,\theta _1...\theta _n)=\frac { \alpha } { m } X^T(X\theta-y) \]

​ (1)實現損失函式、h(x)、梯度下降函式等:

def hypothesis(X, theta):
    return np.dot(X, theta)

def cost_function(X, y, theta):
    m = len(y)
    J = np.sum((hypothesis(X, theta) - y) ** 2) / (2 * m)
    return J

def gradient_descent(X, y, theta, alpha, num_iters):
    m = len(y)
    J_history = np.zeros(num_iters)
    for i in range(num_iters):
        theta = theta - (alpha / m) * np.dot(X.T, (hypothesis(X, theta) - y))
        J_history[i] = cost_function(X, y, theta)
    return theta, J_history

​ (2)處理輸出X並開始計算

X_train_hat= np.hstack((np.ones((X_train.shape[0], 1)), X_train))
X_train= X_train_hat
theta = np.ones(X_train.shape[1])

alpha = 0.0005
num_iters = 120

theta, J_history = gradient_descent(X_train, y_train, theta, alpha, num_iters)


print("Parameter vector: ", theta)
print("Final cost: ", J_history[-1])
  1. 對於邏輯迴歸分析
import numpy as np
from sklearn.metrics import classification_report
from sklearn.preprocessing import LabelEncoder

train_data = np.loadtxt("iris/iris-train.txt", delimiter="\t")
X_train = train_data[:, 0:4]
y_train = train_data[:, 4]
encoder = LabelEncoder()
y_train = encoder.fit_transform(y_train)  # Convert class labels to 0, 1, 2
X_train_hat = np.hstack((np.ones((X_train.shape[0], 1)), X_train))
X_train = X_train_hat
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def cost_function(X, y, theta):
    m = len(y)
    epsilon = 1e-5
    h = sigmoid(np.dot(X, theta))
    J = (-1 / m) * np.sum(y * np.log(h+epsilon) + (1 - y) * np.log(1 - h+epsilon))
    grad = (1 / m) * np.dot(X.T, (h - y))
    return J, grad

def gradient_descent(X, y, theta, alpha, num_iters):
    m = len(y)
    J_history = np.zeros(num_iters)
    for i in range(num_iters):
        J, grad = cost_function(X, y, theta)
        theta = theta - (alpha * grad)
        J_history[i] = J
    return theta, J_history
def logistic_regression(X, y, alpha, num_iters):
    theta = np.ones((X_train.shape[1], 3))
    # Perform gradient descent to minimize the cost function for each class
    for i in range(3):
        y_train_i = (y_train == i).astype(int)
        theta_i, J_history_i = gradient_descent(X_train, y_train_i, theta[:, i], alpha, num_iters)
        import matplotlib.pyplot as plt

        plt.plot(J_history_i)
        plt.xlabel('Iterations')
        plt.ylabel('Cost')
        plt.title('Cost function'+str(i))
        plt.show()
        theta[:, i] = theta_i
        print(f"Class {i} parameter vector: {theta_i}")
        print(f"Class {i} final cost: {J_history_i[-1]}")
    return theta
def predict(test_data, theta):
    X_test = test_data[:, 0:4]
    X_test_hat = np.hstack((np.ones((X_test.shape[0], 1)), X_test))
    X_test = X_test_hat
y_test = test_data[:, 4]
    y_test = encoder.transform(y_test)  # Convert class labels to 0, 1, 2
    y_pred = np.zeros(X_test.shape[0])
    for i in range(3):
        sigmoid_outputs = sigmoid(np.dot(X_test, theta[:, i]))
        threshold = np.median(sigmoid_outputs[y_test == i])#不採用統一的閾值,而是採用每個類別的中位數作為閾值
        y_pred_i = (sigmoid_outputs >= threshold).astype(int)
        y_pred[y_pred_i == 1] = i
    y_pred = encoder.inverse_transform(y_pred.astype(int))  # Convert class labels back to original values
    print(classification_report(y_test, y_pred))

test_data = np.loadtxt("iris/iris-test.txt", delimiter="\t")
theta = logistic_regression(X_train, y_train, 0.0005, 5000)
predict(test_data, theta)

實驗結果和分析

1.對於多變數回歸分析的實驗結果:

img

代價曲線:

img

那麼計算函式也就是:

\[y=-0.306x_1+0.0653x_2+0.280x_3+0.697x_4-0.713 \]

​ 2.對於邏輯迴歸分析的實驗結果

代價曲線:

圖表  描述已自動生成 圖表  描述已自動生成img

​ 利用測試資料iris-test.txt,統計邏輯迴歸對測試資料的識別正確率 ACC,Precision,Recall:

​ 透過sigmoid函式實現的邏輯迴歸只能處理二分類,我是透過分別計算三個二分類實現的。【猜測:預測1時把0和2歸為同一類會對模型產生誤導】

​ 由於實現方法是One VS One.效果並不是很好,而且得到如下結果還是在處理sigmoid函式結果時還修改了原本0.5的閾值,有洩露結果的嫌疑。

不採用統一的閾值,而是採用每個類別的h函式結果中位數作為閾值

圖片包含 表格  描述已自動生成

五、 思考題

給出 RANSAC 線性擬合的 python 實現

import random
import numpy as np
from matplotlib import pyplot as plt


def ransac_line_fit(data, n_iterations, threshold):
    """
    RANSAC algorithm for linear regression.

    Parameters:
    data (list): list of tuples representing the data points
    n_iterations (int): number of iterations to run RANSAC
    threshold (float): maximum distance a point can be from the line to be considered an inlier

    Returns:
    tuple: slope and y-intercept of the best fit line
    """
    best_slope, best_intercept = None, None
    best_inliers = []

    for i in range(n_iterations):
        # Randomly select two points from the data
        sample = random.sample(data, 2)
        x1, y1 = sample[0]
        x2, y2 = sample[1]

        # Calculate slope and y-intercept of line connecting the two points
        slope = (y2 - y1) / (x2 - x1)
        intercept = y1 - slope * x1

        # Find inliers within threshold distance of the line
        inliers = []
        outliers = []
        for point in data:
            x, y = point
            distance = abs(y - (slope * x + intercept))
            distance = distance / np.sqrt(slope ** 2 + 1)
            if distance <= threshold:
                inliers.append(point)
            else:
                outliers.append(point)

        # If the number of inliers is greater than the current best, update the best fit line
        if len(inliers) > len(best_inliers):
            best_slope = slope
            best_intercept = intercept
            best_inliers = inliers

    outliers = [point for point in data if point not in best_inliers]
    # Plot the data points, best fit line, and inliers and outliers
    fig, ax = plt.subplots()
    # ax.scatter([x for x, y in data], [y for x, y in data], color='black')
    ax.scatter([x for x, y in best_inliers], [y for x, y in best_inliers], color='green')
    ax.scatter([x for x, y in outliers], [y for x, y in outliers], color='black')
    x_vals = np.array([-5,5])
    y_vals = best_slope * x_vals + best_intercept
    ax.plot(x_vals, y_vals, '-', color='red')
    # threshold_line = best_slope * x_vals + best_intercept + threshold*np.sqrt((1/best_slope) ** 2 + 1)
    threshold_line = best_slope * x_vals + best_intercept + threshold * np.sqrt(best_slope ** 2 + 1)
    ax.plot(x_vals, threshold_line, '--', color='blue')
    # threshold_line = best_slope * x_vals + best_intercept - threshold*np.sqrt((1/best_slope) ** 2 + 1)
    threshold_line = best_slope * x_vals + best_intercept - threshold * np.sqrt(best_slope ** 2 + 1)
    ax.plot(x_vals, threshold_line, '--', color='blue')
    # ax.set_xlim([-10, 10])
    ax.set_ylim([-6, 6])
    plt.show()

    return best_slope, best_intercept


import numpy as np

# Generate 10 random points with x values between 0 and 10 and y values between -5 and 5
data = [(x, y) for x, y in zip(np.random.uniform(-5, 5, 10), np.random.uniform(-5, 5, 10))]

print(data)

# Fit a line to the data using RANSAC
slope, intercept = ransac_line_fit(data, 10000, 1)
print(slope, intercept)

結果:img

相關文章