呼叫python的sklearn實現Logistic Reression演算法

bigface1234fdfg發表於2015-01-21

呼叫python的sklearn實現Logistic Reression演算法

    

    先說如何實現,其中的匯入資料庫和類、方法的關係,之前不是很清楚,現在知道了。。。   


from numpy import * 
from sklearn.datasets import load_iris     # import datasets

# load the dataset: iris
iris = load_iris() 
samples = iris.data
#print samples 
target = iris.target 

# import the LogisticRegression
from sklearn.linear_model import LogisticRegression 

classifier = LogisticRegression()  # 使用類,引數全是預設的
classifier.fit(samples, target)  # 訓練資料來學習,不需要返回值

x = classifier.predict([5, 3, 5, 2.5])  # 測試資料,分類返回標記

print x 

#其實匯入的是sklearn.linear_model的一個類:LogisticRegression, 它裡面有許多方法
#常用的方法是fit(訓練分類模型)、predict(預測測試樣本的標記)

#不過裡面沒有返回LR模型中學習到的權重向量w,感覺這是一個缺陷


    上面使用的


classifier = LogisticRegression()  # 使用類,引數全是預設的

是預設的,所有的引數全都是預設的,其實我們可以自己設定許多。這需要用到官方給定的引數說明,如下:

sklearn.linear_model.LogisticRegression

class sklearn.linear_model.LogisticRegression(penalty='l2'dual=Falsetol=0.0001C=1.0fit_intercept=True,intercept_scaling=1class_weight=Nonerandom_state=None)

Logistic Regression (aka logit, MaxEnt) classifier.

In the multiclass case, the training algorithm uses a one-vs.-all (OvA) scheme, rather than the “true” multinomial LR.

This class implements L1 and L2 regularized logistic regression using the liblinear library. It can handle both dense and sparse input. Use C-ordered arrays or CSR matrices containing 64-bit floats for optimal performance; any other input format will be converted (and copied).

Parameters:

penalty : string, ‘l1’ or ‘l2’    懲罰項的種類

Used to specify the norm used in the penalization.

dual : boolean

Dual or primal formulation. Dual formulation is only implemented for l2 penalty. Prefer dual=False when n_samples > n_features.

C : float, optional (default=1.0)

Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.

fit_intercept : bool, default: True

Specifies if a constant (a.k.a. bias or intercept) should be added the decision function.

intercept_scaling : float, default: 1

when self.fit_intercept is True, instance vector x becomes [x, self.intercept_scaling], i.e. a “synthetic” feature with constant value equals to intercept_scaling is appended to the instance vector. The intercept becomes intercept_scaling * synthetic feature weight Note! the synthetic feature weight is subject to l1/l2 regularization as all other features. To lessen the effect of regularization on synthetic feature weight (and therefore on the intercept) intercept_scaling has to be increased

class_weight : {dict, ‘auto’}, optional    考慮類不平衡,類似於代價敏感

Over-/undersamples the samples of each class according to the given weights. If not given, all classes are supposed to have weight one. The ‘auto’ mode selects weights inversely proportional to class frequencies in the training set.

random_state: int seed, RandomState instance, or None (default) :

The seed of the pseudo random number generator to use when shuffling the data.

tol: float, optional :

Tolerance for stopping criteria.

Attributes:

`coef_` : array, shape = [n_classes, n_features]

Coefficient of the features in the decision function.

coef_ is readonly property derived from raw_coef_ that follows the internal memory layout of liblinear.

`intercept_` : array, shape = [n_classes]

Intercept (a.k.a. bias) added to the decision function. If fit_intercept is set to False, the intercept is set to zero.


LogisticRegression類中的方法有如下幾種,我們常用的是fit和predict~


Methods

decision_function(X) Predict confidence scores for samples.
densify() Convert coefficient matrix to dense array format.
fit(X, y) Fit the model according to the given training data.    用來訓練LR分類器,其中的X是訓練樣本,y是對應的標記向量
fit_transform(X[, y]) Fit to data, then transform it.
get_params([deep]) Get parameters for this estimator.
predict(X) Predict class labels for samples in X.    用來預測測試樣本的標記,也就是分類。X是測試樣本集
predict_log_proba(X) Log of probability estimates.
predict_proba(X) Probability estimates.
score(X, y[, sample_weight]) Returns the mean accuracy on the given test data and labels.
set_params(**params) Set the parameters of this estimator.
sparsify() Convert coefficient matrix to sparse format.
transform(X[, threshold]) Reduce X to its most important features.

使用predict返回的就是測試樣本的標記向量,其實個人覺得還應有LR分類器中的重要過程引數:權重向量,其size應該是和feature的個數相同。但是就沒有這個方法,所以這就萌生了自己實現LR演算法的念頭,那樣子就可以輸出權重向量了。


參考連結:


http://www.cnblogs.com/xupeizhi/archive/2013/07/05/3174703.html


http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression






相關文章