決策樹簡介
隨機森林概覽
隨機森林和決策樹的衝突(程式碼)
為什麼隨機森林優於決策樹?
決策樹vs隨機森林——你應該在何時選擇何種演算法?
基於樹的演算法:從零開始的完整教程(R & Python)
https://www.analyticsvidhya.com/blog/2016/04/tree-based-algorithms-complete-tutorial-scratch-in-python/?utm_source=blog&utm_medium=decision-tree-vs-random-forest-algorithm
從決策樹開始(免費課程)
https://courses.analyticsvidhya.com/courses/getting-started-with-decision-trees?utm_source=blog&utm_medium=decision-tree-vs-random-forest-algorithm
注:本文的想法是比較決策樹和隨機森林。因此,我不會詳細解釋基本概念,但是我將提供相關連結以便於你可以進一步探究。
從零開始構建一個隨機森林&理解真實世界的資料產品
https://www.analyticsvidhya.com/blog/2018/12/building-a-random-forest-from-scratch-understanding-real-world-data-products-ml-for-programmers-part-3/?utm_source=blog&utm_medium=decision-tree-vs-random-forest-algorithm
隨機森林超引數調優——一個初學者的指南
https://www.analyticsvidhya.com/blog/2020/03/beginners-guide-random-forest-hyperparameter-tuning/?utm_source=blog&utm_medium=decision-tree-vs-random-forest-algorithm
整合學習的綜合指南(使用Python程式碼)
https://www.analyticsvidhya.com/blog/2018/06/comprehensive-guide-for-ensemble-models/?utm_source=blog&utm_medium=decision-tree-vs-random-forest-algorithm
如何在機器學習中建立整合模型?( R程式碼)
注:你可以去DataHack(https://datahack.analyticsvidhya.com/contest/all/?utm_source=blog&utm_medium=decision-tree-vs-random-forest-algorithm )平臺並在不同線上機器學習競賽中與他人競爭,並且有機會獲得令人興奮的獎品。
https://www.analyticsvidhya.com/blog/2016/07/practical-guide-data-preprocessing-python-scikit-learn/?utm_source=blog&utm_medium=decision-tree-vs-random-forest-algorithm
# Data Preprocessing and null values imputation
# Label Encoding
df['Gender']=df['Gender'].map({'Male':1,'Female':0})
df['Married']=df['Married'].map({'Yes':1,'No':0})
df['Education']=df['Education'].map({'Graduate':1,'Not Graduate':0})
df['Dependents'].replace('3+',3,inplace=True)
df['Self_Employed']=df['Self_Employed'].map({'Yes':1,'No':0})
df['Property_Area']=df['Property_Area'].map({'Semiurban':1,'Urban':2,'Rural':3})
df['Loan_Status']=df['Loan_Status'].map({'Y':1,'N':0})
#Null Value Imputation
rev_null=['Gender','Married','Dependents','Self_Employed','Credit_History','LoanAmount','Loan_Amount_Term']
df[rev_null]=df[rev_null].replace({np.nan:df['Gender'].mode(),
np.nan:df['Married'].mode(),
np.nan:df['Dependents'].mode(),
np.nan:df['Self_Employed'].mode(),
np.nan:df['Credit_History'].mode(),
np.nan:df['LoanAmount'].mean(),
np.nan:df['Loan_Amount_Term'].mean()})
rfc_vs_dt-2.py hosted with ❤ by GitHub
第三步:創造訓練集和測試集
現在,讓我們以80:20的比例進行訓練集和測試集的劃分:
X=df.drop(columns=['Loan_ID','Loan_Status']).values
Y=df['Loan_Status'].values
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 42)
rfc_vs_dt-3.py hosted with ❤ by GitHub
讓我們一眼所劃分的訓練集和測試集:
print('Shape of X_train=>',X_train.shape)
print('Shape of X_test=>',X_test.shape)
print('Shape of Y_train=>',Y_train.shape)
print('Shape of Y_test=>',Y_test.shape)
rfc_vs_dt-4.py hosted with ❤ by GitHub
# Building Decision Tree
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(criterion = 'entropy', random_state = 42)
dt.fit(X_train, Y_train)
dt_pred_train = dt.predict(X_train)
rfc_vs_dt-5.py hosted with ❤ by GitHub
https://www.analyticsvidhya.com/blog/2019/08/11-important-model-evaluation-error-metrics/?utm_source=blog&utm_medium=decision-tree-vs-random-forest-algorithm
# Evaluation on Training set
dt_pred_train = dt.predict(X_train)
print('Training Set Evaluation F1-Score=>',f1_score(Y_train,dt_pred_train))
rfc_vs_dt-6.py hosted with ❤ by GitHub
# Evaluating on Test set
dt_pred_test = dt.predict(X_test)
print('Testing Set Evaluation F1-Score=>',f1_score(Y_test,dt_pred_test))
rfc_vs_dt-7.py hosted with ❤ by GitHub
# Building Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(criterion = 'entropy', random_state = 42)
rfc.fit(X_train, Y_train)
# Evaluating on Training set
rfc_pred_train = rfc.predict(X_train)
print('Training Set Evaluation F1-Score=>',f1_score(Y_train,rfc_pred_train))
rfc_vs_dt-8.py hosted with ❤ by GitHub
f1 score random forest
# Evaluating on Test set
rfc_pred_test = rfc.predict(X_test)
print('Testing Set Evaluation F1-Score=>',f1_score(Y_test,rfc_pred_test))
rfc_vs_dt-9.py hosted with ❤ by GitHub
為什麼我們的隨機森林模型比決策樹表現更好?
feature_importance=pd.DataFrame({
'rfc':rfc.feature_importances_,
'dt':dt.feature_importances_
},index=df.drop(columns=['Loan_ID','Loan_Status']).columns)
feature_importance.sort_values(by='rfc',ascending=True,inplace=True)
index = np.arange(len(feature_importance))
fig, ax = plt.subplots(figsize=(18,8))
rfc_feature=ax.barh(index,feature_importance['rfc'],0.4,color='purple',label='Random Forest')
dt_feature=ax.barh(index+0.4,feature_importance['dt'],0.4,color='lightgreen',label='Decision Tree')
ax.set(yticks=index+0.4,yticklabels=feature_importance.index)
ax.legend()
plt.show()
rfc_vs_dt-10.py hosted with ❤ by GitHub
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html#sklearn.ensemble.BaggingClassifier
https://www.analyticsvidhya.com/blog/2019/08/decoding-black-box-step-by-step-guide-interpretable-machine-learning-models-python/?utm_source=blog&utm_medium=decision-tree-vs-random-forest-algorithm
原文標題:
Decision Tree vs. Random Forest – Which Algorithm Should you Use?
原文連結:
https://www.analyticsvidhya.com/blog/2020/05/decision-tree-vs-random-forest-algorithm/