智算之道——2020人工智慧應用挑戰賽(初賽)疾病預測結構化資料

驚蟄Jingz發表於2020-10-17

智算之道(初賽)疾病預測(CatBoost、lgb、XGB三個模型投票)

一、題目描述

通過結構化的資料預測與分析,判斷一名病人的是否患有肝炎。
在這裡插入圖片描述

二、模型

  1. CatBoost
model1 = CatBoostClassifier(iterations=200
			   ,learning_rate=0.1
			   ,loss_function='Logloss')
  1. XGB
model2 = XGBClassifier(learning_rate=0.1		      		   						
		      ,n_estimators=1000 # 樹的個數                       		     		      		     			
		      ,max_depth=6 # 樹的深度		       
		      ,min_child_weight = 1 # 葉子節點最小權重		       
		      ,gamma=0.2 # 懲罰項中葉子結點個數前的引數		      
		      ,subsample=0.8 # 隨機選擇80%樣本建立決策樹		      
		      ,colsample_btree=0.8 # 隨機選擇80%特徵建立決策樹		      
		      ,objective='binary:logistic'		      
		      ,scale_pos_weight=1 # 解決樣本個數不平衡的問題                       
		      ,random_state=27 # 隨機數)
  1. lgb
model3 = lgb.LGBMClassifier(objective='regression'
			   ,learning_rate=0.1
			   ,n_estimators=20
		   	   ,subsample=0.8
  			   ,colsample_bytree=0.8
			   ,num_leaves=22
			   ,max_depth =5
			   ,min_child_samples=20
			   ,min_child_weight=0.00005
			   ,feature_fraction = 0.6
 			   ,bagging_fraction = 0.7)

三、程式碼

## 顯示cell執行時長
%load_ext klab-autotime
## 匯入包(有些包沒有用到)
import osfrom sklearn.ensemble 
import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import warnings
from itertools import combinations
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.model_selection import StratifiedKFold
from tqdm import tqdm
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
import lightgbm as lgb
## 資料處理

%matplotlib inlinewarnings.filterwarnings('ignore')pd.set_option('display.max_rows',None)pd.set_option('display.max_columns',None)

path = '/home/kesci/data/competition_A/'
train_df = pd.read_csv(path+'train_set.csv') 
test_df  = pd.read_csv(path+'test_set.csv') 

submission1  =  pd.read_csv(path+'submission_example.csv') 
submission2  =  pd.read_csv(path+'submission_example.csv') 
submission3  =  pd.read_csv(path+'submission_example.csv') 
submission  =  pd.read_csv(path+'submission_example.csv') 

print('Train Shape:{}\nTest Shape:{}'.format(train_df.shape,test_df.shape))train_df.head()

num_columns = ['年齡','體重','身高','體重指數', '腰圍', '最高血壓', '最低血壓',
		'好膽固醇', '壞膽固醇', '總膽固醇','收入']
zero_to_one_columns = ['肥胖腰圍','血脂異常','PVD']str_columns = ['性別','區域','體育活動','教育','未婚',
	'護理來源','視力不佳','飲酒','高血壓', '家庭高血壓', '糖尿病', '家族糖尿病','家族肝炎', '慢性疲勞','ALF']

# 字元編碼
for i in tqdm(str_columns):  	
	lbl = LabelEncoder()    
	train_df[i] = lbl.fit_transform(train_df[i].astype(str))    
	test_df[i] = lbl.fit_transform(test_df[i].astype(str))

# 數值歸一化
train_df[num_columns] = MinMaxScaler().fit_transform(train_df[num_columns])
test_df[num_columns]  = MinMaxScaler().fit_transform(test_df[num_columns])

# 空值填充
train_df.fillna(0,inplace=True)test_df.fillna(0,inplace=True)

# 準備資料集
all_columns = [i for i in train_df.columns if i not in ['肝炎','ID']]
train_x,train_y = train_df[all_columns].values,train_df['肝炎'].values
test_x  = test_df[all_columns].values
submission1['hepatitis'] =0
submission2['hepatitis'] =0
submission3['hepatitis'] =0
kfold = StratifiedKFold(n_splits=5, shuffle=False)
## 訓練

model1 = CatBoostClassifier(iterations=200
      ,learning_rate=0.1
      ,loss_function='Logloss')

model2 = XGBClassifier(learning_rate=0.1                   
        ,n_estimators=1000 # 樹的個數                                                
        ,max_depth=6 # 樹的深度         
        ,min_child_weight = 1 # 葉子節點最小權重         
        ,gamma=0.2 # 懲罰項中葉子結點個數前的引數        
        ,subsample=0.8 # 隨機選擇80%樣本建立決策樹        
        ,colsample_btree=0.8 # 隨機選擇80%特徵建立決策樹        
        ,objective='binary:logistic'        
        ,scale_pos_weight=1 # 解決樣本個數不平衡的問題                       
        ,random_state=27 # 隨機數)
     
model3 = lgb.LGBMClassifier(objective='regression'
			   ,learning_rate=0.1
			   ,n_estimators=20
			   ,subsample=0.8
			   ,colsample_bytree=0.8
			   ,num_leaves=22
			   ,max_depth =5
			   ,min_child_samples=20
			   ,min_child_weight=0.00005
			   ,feature_fraction = 0.6
			   ,bagging_fraction = 0.7)

param_grid = {'learning_rate': [0.01, 0.1, 1], 'n_estimators': [20, 40]}
for train, valid in kfold.split(train_x, train_y):    
	X_train, Y_train = train_x[train], train_y[train]    
	X_valid, Y_valid = train_x[valid], train_y[valid]
    	model1.fit(X_train,Y_train, eval_set=(X_valid, Y_valid),use_best_model=True)    
        model2.fit(X_train,Y_train,eval_set = [(X_valid, Y_valid)], early_stopping_rounds = 100, verbose = True)     
        model3.fit(X_train, Y_train, eval_set=[(X_valid, Y_valid)], eval_metric='l1', early_stopping_rounds=5)      
        
        Y_valid_pred_prob1 = model1.predict_proba(X_valid)    
        Y_valid_pred_prob2 = model2.predict_proba(X_valid)

        submission1['hepatitis'] += model1.predict_proba(test_x)[:,1] / 5    
        submission2['hepatitis'] += model2.predict_proba(test_x)[:,1] / 5
    
	# 網格搜尋,引數優化    
	estimator = lgb.LGBMRegressor(num_leaves=31)    
	model3 = GridSearchCV(estimator, param_grid)   
	model3.fit(X_train, Y_train)    
	submission5['hepatitis'] += model.predict(test_x) / 5
	print('Best parameters found by grid search are:', model3.best_params_)
## 三個模型投票
submission['hepatitis'] =0
submission['hepatitis'] = (submission1['hepatitis']+submission2['hepatitis']+submission3['hepatitis'])/3

# 寫入檔案
submission.to_csv('pingfen.csv',index=False)

# 提交

四、調參方面總結

  1. 原先設定KFold=5, 後面看別人的貼子發現kFold=10可以準確率。
  2. 原先訓練的時候人為選擇了一些認為對是否得肝炎影響較大的特徵,最後發現還是把所有特徵丟進去訓練準確率高。
  3. 經過測試,發現模型中learning rate=0.1效果最好。
  4. 單個模型準確率提不上去時,可以考慮多個準確率較好的模型投票。

程式碼是在這篇文章上https://blog.csdn.net/qq_44574333/article/details/108964488進行修改的

相關文章