版權宣告:本套技術專欄是作者(秦凱新)平時工作的總結和昇華,通過從真實商業環境抽取案例進行總結和分享,並給出商業應用的調優建議和叢集環境容量規劃等內容,請持續關注本套部落格。期待加入IOT時代最具戰鬥力的團隊。QQ郵箱地址:1120746959@qq.com,如有任何學術交流,可隨時聯絡。
1 資料預處理
-
資料集介紹
import pandas #ipython notebook titanic = pandas.read_csv("C:\\ML\\MLData\\titanic_train.csv") # Pclass 貴族社會等級 SlibSp 兄弟姐妹個數 Parch 老人和孩子個數 Ticket 船票編號 Fare 費用 Cabin Embarked 不同的上船地點 # 載入後,樣本多了索引0,1,2 ..... PassengerID(ID) Survived(存活與否) Pclass(客艙等級) Name(名字) Sex(性別) Age(年齡) Parch(子女父母關係人數) SibSp(兄弟姐妹關係人數) Ticket(票編號) Fare(票價) Cabin(客艙編號) Embarked(上船的港口編號) titanic.head(3) 複製程式碼
-
發現Age的count的數量為714個,小於891,即出現缺失值。
print (titanic.describe()) PassengerId Survived Pclass Age SibSp \ count 891.000000 891.000000 891.000000 714.000000 891.000000 mean 446.000000 0.383838 2.308642 29.699118 0.523008 std 257.353842 0.486592 0.836071 14.526497 1.102743 min 1.000000 0.000000 1.000000 0.420000 0.000000 25% 223.500000 0.000000 2.000000 20.125000 0.000000 50% 446.000000 0.000000 3.000000 28.000000 0.000000 75% 668.500000 1.000000 3.000000 38.000000 1.000000 max 891.000000 1.000000 3.000000 80.000000 8.000000 ,Parch Fare count 891.000000 891.000000 mean 0.381594 32.204208 std 0.806057 49.693429 min 0.000000 0.000000 25% 0.000000 7.910400 50% 0.000000 14.454200 75% 0.000000 31.000000 max 6.000000 512.329200 複製程式碼
-
從資料裡面可以看到資料中性別(Sex)是一個列舉字串male或female,為了讓計算機更好的處理這列資料,我們將其數值化處理
# map方法數字化處理並改變pandas的列型別 titanic['Sex'] = titanic['Sex'].map({'female': 1, 'male': 0}).astype(int) 複製程式碼
-
缺失值填充(使用均值)
titanic["Age"] = titanic["Age"].fillna(titanic["Age"].median()) print (titanic.describe()) PassengerId Survived Pclass Age SibSp \ count 891.000000 891.000000 891.000000 891.000000 891.000000 mean 446.000000 0.383838 2.308642 29.361582 0.523008 std 257.353842 0.486592 0.836071 13.019697 1.102743 min 1.000000 0.000000 1.000000 0.420000 0.000000 25% 223.500000 0.000000 2.000000 22.000000 0.000000 50% 446.000000 0.000000 3.000000 28.000000 0.000000 75% 668.500000 1.000000 3.000000 35.000000 1.000000 max 891.000000 1.000000 3.000000 80.000000 8.000000 Parch Fare count 891.000000 891.000000 mean 0.381594 32.204208 std 0.806057 49.693429 min 0.000000 0.000000 25% 0.000000 7.910400 50% 0.000000 14.454200 75% 0.000000 31.000000 max 6.000000 512.329200 複製程式碼
-
String值性別轉換(樣本定位後,進行替換)
print (titanic["Sex"].unique()) # Replace all the occurences of male with the number 0. titanic.loc[titanic["Sex"] == "male", "Sex"] = 0 titanic.loc[titanic["Sex"] == "female", "Sex"] = 1 複製程式碼
-
String值登船地點轉換(樣本定位後,進行替換)
print (titanic["Embarked"].unique()) titanic["Embarked"] = titanic["Embarked"].fillna('S') titanic.loc[titanic["Embarked"] == "S", "Embarked"] = 0 titanic.loc[titanic["Embarked"] == "C", "Embarked"] = 1 titanic.loc[titanic["Embarked"] == "Q", "Embarked"] = 2 複製程式碼
-
線性迴歸測試
# Import the linear regression class from sklearn.linear_model import LinearRegression # Sklearn also has a helper that makes it easy to do cross validation from sklearn.model_selection import KFold # The columns we'll use to predict the target predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"] # Initialize our algorithm class alg = LinearRegression() # Generate cross validation folds for the titanic dataset. It return the row indices corresponding to train and test. # We set random_state to ensure we get the same splits every time we run this. kf = KFold(n_splits=3, random_state=1, shuffle=False) predictions = [] for train, test in kf.split(titanic): # The predictors we're using the train the algorithm. Note how we only take the rows in the train folds. train_predictors = (titanic[predictors].iloc[train,:]) # The target we're using to train the algorithm. train_target = titanic["Survived"].iloc[train] # Training the algorithm using the predictors and target. alg.fit(train_predictors, train_target) # We can now make predictions on the test fold test_predictions = alg.predict(titanic[predictors].iloc[test,:]) predictions.append(test_predictions) import numpy as np # The predictions are in three separate numpy arrays. Concatenate them into one. # We concatenate them on axis 0, as they only have one axis. predictions = np.concatenate(predictions, axis=0) # Map predictions to outcomes (only possible outcomes are 1 and 0) predictions[predictions > .5] = 1 predictions[predictions <=.5] = 0 accuracy = sum(predictions[predictions == titanic["Survived"]]) / len(predictions) print (accuracy) 0.2615039281705948 複製程式碼
-
線性迴歸交叉驗證測試
scores = cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=kf) print (scores) 複製程式碼
-
邏輯迴歸測試
from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression # Initialize our algorithm alg = LogisticRegression(random_state=1) # Compute the accuracy score for all the cross validation folds. (much simpler than what we did before!) scores = cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3) # Take the mean of the scores (because we have one for each fold) print(scores.mean()) 0.7878787878787877 複製程式碼
-
隨機森林測試
import pandas #ipython notebook import numpy as np from sklearn.model_selection import KFold from sklearn.model_selection import train_test_split from sklearn.model_selection import cross_val_score from sklearn.ensemble import RandomForestClassifier titanic_test = pandas.read_csv("C:\\ML\\MLData\\titanic_train.csv") titanic_test["Age"] = titanic_test["Age"].fillna(titanic_test["Age"].median()) titanic_test["Fare"] = titanic_test["Fare"].fillna(titanic_test["Fare"].median()) titanic_test.loc[titanic_test["Sex"] == "male", "Sex"] = 0 titanic_test.loc[titanic_test["Sex"] == "female", "Sex"] = 1 titanic_test["Embarked"] = titanic_test["Embarked"].fillna("S") titanic_test.loc[titanic_test["Embarked"] == "S", "Embarked"] = 0 titanic_test.loc[titanic_test["Embarked"] == "C", "Embarked"] = 1 titanic_test.loc[titanic_test["Embarked"] == "Q", "Embarked"] = 2 predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"] # Initialize our algorithm with the default paramters # n_estimators is the number of trees we want to make # min_samples_split is the minimum number of rows we need to make a split # min_samples_leaf is the minimum number of samples we can have at the place where a tree branch ends (the bottom points of the tree) alg = RandomForestClassifier(random_state=1, n_estimators=50, min_samples_split=2, min_samples_leaf=1) # Compute the accuracy score for all the cross validation folds. (much simpler than what we did before!) kf = KFold(n_splits=3, random_state=1, shuffle=False) scores = cross_val_score(alg, titanic_test[predictors], titanic_test["Survived"], cv=kf) 0.7901234567901234 複製程式碼
-
資料預處理
# Take the mean of the scores (because we have one for each fold) print(scores.mean()) # Generating a familysize column titanic_test["FamilySize"] = titanic_test["SibSp"] + titanic_test["Parch"] # The .apply method generates a new series titanic_test["NameLength"] = titanic_test["Name"].apply(lambda x: len(x)) import re # A function to get the title from a name. def get_title(name): # Use a regular expression to search for a title. Titles always consist of capital and lowercase letters, and end with a period. title_search = re.search(' ([A-Za-z]+)\.', name) # If the title exists, extract and return it. if title_search: return title_search.group(1) return "" # Get all the titles and print how often each one occurs. titles = titanic_test["Name"].apply(get_title) print(pandas.value_counts(titles)) # Map each title to an integer. Some titles are very rare, and are compressed into the same codes as other titles. title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Dr": 5, "Rev": 6, "Major": 7, "Col": 7, "Mlle": 8, "Mme": 8, "Don": 9, "Lady": 10, "Countess": 10, "Jonkheer": 10, "Sir": 9, "Capt": 7, "Ms": 2} for k,v in title_mapping.items(): titles[titles == k] = v # Verify that we converted everything. print(pandas.value_counts(titles)) # Add in the title column. titanic_test["Title"] = titles Mr 517 Miss 182 Mrs 125 Master 40 Dr 7 Rev 6 Major 2 Col 2 Mlle 2 Don 1 Capt 1 Ms 1 Jonkheer 1 Countess 1 Sir 1 Mme 1 Lady 1 複製程式碼
-
特徵相關性分析(feature correlations)
相關係數(方法包括三種:pearson,kendall,spearman),這裡也不擴充套件,反正能夠計算出資料之間的相關性。
def plot_corr(df,size=10):
'''Function plots a graphical correlation matrix for each pair of columns in the dataframe.
Input:
df: pandas DataFrame
size: vertical and horizontal size of the plot'''
corr = df.corr()
fig, ax = plt.subplots(figsize=(size, size))
ax.matshow(corr)
for (i, j), z in np.ndenumerate(corr):
ax.text(j, i, '{:.2f}'.format(z), ha='center', va='center')
plt.xticks(range(len(corr.columns)), corr.columns)
plt.yticks(range(len(corr.columns)), corr.columns)
# 特徵相關性圖表
plot_corr(df)
複製程式碼
-
多特徵隨機森林測試(增加訓練特徵)
0.7979797979797979import numpy as np from sklearn.feature_selection import SelectKBest, f_classif import matplotlib.pyplot as plt predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked", "FamilySize", "Title", "NameLength"] # Perform feature selection selector = SelectKBest(f_classif, k=5) selector.fit(titanic_test[predictors], titanic_test["Survived"]) # Get the raw p-values for each feature, and transform from p-values into scores scores = -np.log10(selector.pvalues_) # Plot the scores. See how "Pclass", "Sex", "Title", and "Fare" are the best? plt.bar(range(len(predictors)), scores) plt.xticks(range(len(predictors)), predictors, rotation='vertical') plt.show() # Pick only the four best features. predictors = ["Pclass", "Sex", "Fare", "Title"] # Initialize our algorithm with the default paramters # n_estimators is the number of trees we want to make # min_samples_split is the minimum number of rows we need to make a split # min_samples_leaf is the minimum number of samples we can have at the place where a tree branch ends (the bottom points of the tree) alg = RandomForestClassifier(random_state=1, n_estimators=50, min_samples_split=2, min_samples_leaf=1) # Compute the accuracy score for all the cross validation folds. (much simpler than what we did before!) kf = KFold(n_splits=3, random_state=1, shuffle=False) scores = cross_val_score(alg, titanic_test[predictors], titanic_test["Survived"], cv=kf) # Take the mean of the scores (because we have one for each fold) print(scores.mean()) 複製程式碼
2 數學原理(誰來當root問題)
-
決策樹案例
-
不同特徵的概率分佈
- 無特徵的總資訊熵(9個打球,5個不打球,無特徵的總資訊熵為0.94)
-
在已知outlook的情況下,不同選擇的資訊熵(suuny為0.971,hunidity=0, rainy=0.971),但是綜合後才是outlook的資訊熵。
-
選擇outlook也是有概率的(比如:suuny為5/14,hunidity=4/14,rainy=5/14 , 疊加起來才是outlook的資訊熵)
-
選擇資訊增益下降最快的
- ID3資訊增益的弊端(特徵過多,變化較少)
- C4.5演算法能夠較好的處理連續值
-
預剪枝, CaT評價函式表示葉子節點數越多,損失越大。儘量減少葉子節點的個數
-
隨機森林
-
決策樹引數調優
from sklearn.tree import DecisionTreeClassifier # 1.criterion gini or entropy(基於gini係數和熵值來指定) # 2.splitter best or random 前者是在所有特徵中找最好的切分點 後者是在部分特徵中(資料量大的時候) # 3.max_features None(所有) 特徵小於50的時候一般使用所有的 ,log2,sqrt,N # 4.max_depth 資料少或者特徵少的時候可以不管這個值,如果模型樣本量多,特徵也多的情況下,可以嘗試限制下 # 5.min_samples_split 如果某節點的樣本數少於min_samples_split,則不會繼續再嘗試選擇最優特徵來進行劃分 # 如果樣本量不大,不需要管這個值。如果樣本量數量級非常大,則推薦增大這個值。 # 6.min_samples_leaf 這個值限制了葉子節點最少的樣本數,如果某葉子節點數目小於樣本數,則會和兄弟節點一起被 # 剪枝,如果樣本量不大,不需要管這個值,大些如10W可是嘗試下5 # 7.min_weight_fraction_leaf 這個值限制了葉子節點所有樣本權重和的最小值,如果小於這個值,則會和兄弟節點一起 # 被剪枝預設是0,就是不考慮權重問題。一般來說,如果我們有較多樣本有缺失值, # 或者分類樹樣本的分佈類別偏差很大,就會引入樣本權重,這時我們就要注意這個值了。 # 8.max_leaf_nodes 通過限制最大葉子節點數,可以防止過擬合,預設是"None”,即不限制最大的葉子節點數。 # 如果加了限制,演算法會建立在最大葉子節點數內最優的決策樹。 # 如果特徵不多,可以不考慮這個值,但是如果特徵分成多的話,可以加以限制 # 具體的值可以通過交叉驗證得到。 # 9.class_weight 指定樣本各類別的的權重,主要是為了防止訓練集某些類別的樣本過多 # 導致訓練的決策樹過於偏向這些類別。這裡可以自己指定各個樣本的權重 # 如果使用“balanced”,則演算法會自己計算權重,樣本量少的類別所對應的樣本權重會高。 # 10.min_impurity_split 這個值限制了決策樹的增長,如果某節點的不純度 # (基尼係數,資訊增益,均方差,絕對差)小於這個閾值 # 則該節點不再生成子節點。即為葉子節點 。 decision_tree_classifier = DecisionTreeClassifier() # Train the classifier on the training set decision_tree_classifier.fit(training_inputs, training_classes) # Validate the classifier on the testing set using classification accuracy decision_tree_classifier.score(testing_inputs, testing_classes) 複製程式碼
-
級聯預測
from sklearn.ensemble import GradientBoostingClassifier from sklearn.linear_model import LogisticRegression import numpy as np # The algorithms we want to ensemble. # We're using the more linear predictors for the logistic regression, and everything with the gradient boosting classifier. algorithms = [ [GradientBoostingClassifier(random_state=1, n_estimators=50, max_depth=5), ["Pclass", "Sex", "Age", "Fare", "Embarked", "FamilySize", "Title",]], [LogisticRegression(random_state=1), ["Pclass", "Sex", "Fare", "FamilySize", "Title", "Age", "Embarked"]] ] # Initialize the cross validation folds kf = KFold(n_splits=3, random_state=1, shuffle=False) predictions = [] for train, test in kf.split(titanic_test): train_target = titanic_test["Survived"].iloc[train] full_test_predictions = [] # Make predictions for each algorithm on each fold for alg, predictors in algorithms: # Fit the algorithm on the training data. alg.fit(titanic_test[predictors].iloc[train,:], train_target) # Select and predict on the test fold. # The .astype(float) is necessary to convert the dataframe to all floats and avoid an sklearn error. test_predictions = alg.predict_proba(titanic_test[predictors].iloc[test,:].astype(float))[:,1] full_test_predictions.append(test_predictions) # Use a simple ensembling scheme -- just average the predictions to get the final classification. test_predictions = (full_test_predictions[0] + full_test_predictions[1]) / 2 # Any value over .5 is assumed to be a 1 prediction, and below .5 is a 0 prediction. test_predictions[test_predictions <= .5] = 0 test_predictions[test_predictions > .5] = 1 predictions.append(test_predictions) # Put all the predictions together into one array. predictions = np.concatenate(predictions, axis=0) # Compute accuracy by comparing to the training data. accuracy = sum(predictions[predictions == titanic_test["Survived"]]) / len(predictions) print(accuracy) 複製程式碼
總結
sklearn新變動較大,導致線性迴歸的測試出現KFold不相容問題,暫時沒有解決,需要持續關注。
版權宣告:本套技術專欄是作者(秦凱新)平時工作的總結和昇華,通過從真實商業環境抽取案例進行總結和分享,並給出商業應用的調優建議和叢集環境容量規劃等內容,請持續關注本套部落格。期待加入IOT時代最具戰鬥力的團隊。QQ郵箱地址:1120746959@qq.com,如有任何學術交流,可隨時聯絡。
秦凱新 於深圳 201812090216