天池 ·【新人赛】工业蒸汽量预测建模算法【二】
之前使用线性模型,DNN,CNN来进行预测,效果一般。
数据处理
读取数据
1
2
3
4
5
6
7
8train_data_path = "data/zhengqi_train.txt"
test_data_path = "data/zhengqi_test.txt"
source = pd.read_table(train_data_path, sep='\t').values
test_data = pd.read_table(test_data_path, sep='\t').values
source_data = source[:, 0:-1]
source_target =source[:, -1]标准化处理
1
2
3
4
5
6# 标准化
source_mean = (source_data - np.min(source_data, axis=0)) / (np.max(source_data, axis=0)-np.min(source_data, axis=0))
source_mean = np.column_stack((source_data,source_mean))
test_mean = (test_data - np.min(test_data, axis=0)) / (np.max(test_data, axis=0)-np.min(test_data, axis=0))
test_mean = np.column_stack((test_data,test_mean))pca处理
1 | pca = PCA(n_components=0.95) #0.95 |
标准化和PCA这步,我们还不确定是否真的需要。
- 数据分割
1
2
3X_train, X_test, Y_train, Y_test = train_test_split(source_data, source_target, test_size=0.2, random_state=40)
X_train_mean, X_test_mean, Y_train, Y_test = train_test_split(source_mean, source_target, test_size=0.2, random_state=40)
X_train_mean_pca, X_test_mean_pca, Y_train, Y_test = train_test_split(source_mean_pca, source_target, test_size=0.2, random_state=40)
XGB
网格调参
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29xgb_params = {'learning_rate': 0.1, 'n_estimators': 500,
'max_depth': 5, 'min_child_weight': 1,
'seed': 0, 'subsample': 0.8, 'colsample_bytree': 0.8,
'gamma': 0, 'reg_alpha': 0, 'reg_lambda': 1}
back_params = {
'n_estimators': [i for i in range(500,700,10)],
'learning_rate': np.linspace(0.01,0.2,20),
'max_depth': [i for i in range(3,11)],
'min_child_weight': [i for i in range(1,7)],
'gamma': np.linspace(0,1,11),
'subsample': np.linspace(0.1,0.9,9),
'colsample_bytree': np.linspace(0.1,0.9,9),
'reg_alpha': np.linspace(0.1,3,30),
'reg_lambda': np.linspace(0.1,3,30),
}
for param in back_params:
temp_param = {param:back_params[param]}
estimator = xgb.XGBRegressor(**xgb_params)
optimized_XGB = GridSearchCV(estimator, param_grid = temp_param, scoring='neg_mean_squared_error',
cv=5, verbose=False, n_jobs=10)
# optimized_XGB.fit(train, train_target)
# 加入前 0.10844018858214664
# 加入平均值后
optimized_XGB.fit(X_train_mean, Y_train_mean)
xgb_params.update(optimized_XGB.best_params_)
print('参数的最佳取值:{0}'.format(optimized_XGB.best_params_))
print('最佳模型得分:{0}'.format(-optimized_XGB.best_score_))
print(xgb_params)更新过程
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18参数的最佳取值:{'n_estimators': 690}
最佳模型得分:0.1248743341047175
参数的最佳取值:{'learning_rate': 0.02}
最佳模型得分:0.11986767956401062
参数的最佳取值:{'max_depth': 4}
最佳模型得分:0.11970729128001024
参数的最佳取值:{'min_child_weight': 4}
最佳模型得分:0.1177363914351903
参数的最佳取值:{'gamma': 0.0}
最佳模型得分:0.1177363914351903
参数的最佳取值:{'subsample': 0.4}
最佳模型得分:0.1162276478005415
参数的最佳取值:{'colsample_bytree': 0.6}
最佳模型得分:0.1155322491989233
参数的最佳取值:{'reg_alpha': 0.5}
最佳模型得分:0.1149495825651894
参数的最佳取值:{'reg_lambda': 0.1}
最佳模型得分:0.11459993910716135获得参数
1
2xgb_params_without_mean = {'learning_rate': 0.02, 'n_estimators': 660, 'max_depth': 7, 'min_child_weight': 1, 'seed': 0, 'subsample': 0.1, 'colsample_bytree': 0.2, 'gamma': 0.0, 'reg_alpha': 0.2, 'reg_lambda': 0.3}
xgb_params_with_mean = {'learning_rate': 0.02, 'n_estimators': 690, 'max_depth': 4, 'min_child_weight': 4, 'seed': 0, 'subsample': 0.4, 'colsample_bytree': 0.6, 'gamma': 0.0, 'reg_alpha': 0.5, 'reg_lambda': 0.1}- 训练模型
1
2
3
4
5def train_xgb(params,X_train, Y_train,X_test,Y_test):
xgb_model = xgb.XGBRegressor(**params)
xgb_model.fit(X_train, Y_train)
Y_pred = xgb_model.predict(X_test)
return mean_squared_error(Y_test,Y_pred) - 测试集效果获得
1
2
3
4
5
6print("最原始的XGB:{}".format(
train_xgb(xgb_params_without_mean,X_train, Y_train,X_test,Y_test)))
print("增加标准化的XGB:{}".format(
train_xgb(xgb_params_with_mean,X_train_mean, Y_train,X_test_mean,Y_test)))
print("PCA标准化的XGB:{}".format(
train_xgb(xgb_params_with_mean,X_train_mean_pca, Y_train, X_test_mean_pca,Y_test)))PCA参数没有挑,因为PCA目的是为了防止过拟合,测试发现会降低效果。1
2
3最原始的XGB:0.09825279843671066
增加标准化的XGB:0.09208487929988615
PCA标准化的XGB:0.10937159141681482
lightGBM
网格调参
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22LGBM_params = {'num_leaves':50,'max_depth':13,'learning_rate':0.1,
'n_estimators':400, 'min_child_weight':1, 'subsample':0.8,
'colsample_bytree':0.8, 'nthread':7,'objective':'regression'}
back_params = {
'n_estimators': [i for i in range(400,900,25)],
'num_leaves': [i for i in range(10,45,5)],
'max_depth': [i for i in range(3,11)], 'learning_rate':np.linspace(0.01,0.2,20),
'min_child_weight': [i for i in range(1,7)],
'subsample': np.linspace(0.1,0.9,9),
'colsample_bytree': np.linspace(0.1,0.9,9),
}
for param in back_params:
temp_param = {param:back_params[param]}
estimator = LGBMRegressor(**LGBM_params)
optimized_LGBM = GridSearchCV(estimator, param_grid = temp_param, scoring='neg_mean_squared_error',
cv=5, verbose=False, n_jobs=10)
optimized_LGBM.fit(X_train_mean, Y_train_mean)
LGBM_params.update(optimized_LGBM.best_params_)
print('参数的最佳取值:{0}'.format(optimized_LGBM.best_params_))
print('最佳模型得分:{0}'.format(-optimized_LGBM.best_score_))
print(LGBM_params)更新过程
1
2
3
4
5
6
7
8
9
10
11
12
13
14参数的最佳取值:{'n_estimators': 625}
最佳模型得分:0.11614724424510146
参数的最佳取值:{'num_leaves': 20}
最佳模型得分:0.11266928071626249
参数的最佳取值:{'max_depth': 5}
最佳模型得分:0.11318898749529319
参数的最佳取值:{'learning_rate': 0.1}
最佳模型得分:0.11318898749529319
参数的最佳取值:{'min_child_weight': 1}
最佳模型得分:0.11318898749529319
参数的最佳取值:{'subsample': 0.1}
最佳模型得分:0.11318898749529319
参数的最佳取值:{'colsample_bytree': 0.3}
最佳模型得分:0.11188844667277635获得参数
1
2lgbm_params_without_mean = {'num_leaves': 15, 'max_depth': 10, 'learning_rate': 0.03, 'n_estimators': 775, 'min_child_weight': 1, 'subsample': 0.1, 'colsample_bytree': 0.3, 'nthread': 7, 'objective': 'regression'}
lgbm_params_with_mean = {'num_leaves': 20, 'max_depth': 5, 'learning_rate': 0.1, 'n_estimators': 625, 'min_child_weight': 1, 'subsample': 0.1, 'colsample_bytree': 0.3, 'nthread': 7, 'objective': 'regression'}- 训练模型获得
1
2
3
4
5
6
7
8
9
10
11
12def train_LGBM(params,X_train, Y_train,X_test,Y_test):
LGBM_model = LGBMRegressor(**params)
LGBM_model.fit(X_train, Y_train)
Y_pred = LGBM_model.predict(X_test)
return mean_squared_error(Y_test,Y_pred)
print("最原始的LGBM:{}".format(
train_LGBM(lgbm_params_without_mean,X_train, Y_train,X_test,Y_test)))
print("增加标准化的LGBM:{}".format(
train_LGBM(lgbm_params_with_mean,X_train_mean, Y_train,X_test_mean,Y_test)))
print("PCA标准化的LGBM:{}".format(
train_LGBM(lgbm_params_without_mean,X_train_mean_pca, Y_train, X_test_mean_pca,Y_test)))1
2
3最原始的LGBM:0.09394115199163632
增加标准化的LGBM:0.09412533355830843
PCA标准化的LGBM:0.12151567895565014
catboost
- 网格调参更新过程
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24cat_params = {'n_estimators': 82,
'depth': 5,
'learning_rate': 0.1,
'l2_leaf_reg': 3,
'loss_function': 'RMSE',
'logging_level': 'Silent'}
back_params = {
'n_estimators': [i for i in range(400,900,25)],
'depth': [i for i in range(1,10,1)],
'learning_rate':np.linspace(0.01,0.2,20),
'l2_leaf_reg': [i for i in range(1,5,1)],
}
for param in back_params:
temp_param = {param:back_params[param]}
estimator = CatBoostRegressor(**cat_params)
optimized_CAT = GridSearchCV(estimator, param_grid = temp_param, scoring='neg_mean_squared_error',
cv=5, verbose=False, n_jobs=10)
optimized_CAT.fit(X_train_mean, Y_train_mean)
cat_params.update(optimized_CAT.best_params_)
print('参数的最佳取值:{0}'.format(optimized_CAT.best_params_))
print('最佳模型得分:{0}'.format(-optimized_CAT.best_score_))
print(cat_params)1
2
3
4
5
6
7
8参数的最佳取值:{'n_estimators': 875}
最佳模型得分:0.11683246282153908
参数的最佳取值:{'depth': 4}
最佳模型得分:0.11546548200806038
参数的最佳取值:{'learning_rate': 0.08}
最佳模型得分:0.11471602626521735
参数的最佳取值:{'l2_leaf_reg': 3}
最佳模型得分:0.11471602626521735 - 获得参数
1
2cat_params_without_mean = {'n_estimators': 875, 'depth': 4, 'learning_rate': 0.1, 'l2_leaf_reg': 3, 'loss_function': 'RMSE', 'logging_level': 'Silent'}
cat_params_with_mean = {'n_estimators': 875, 'depth': 4, 'learning_rate': 0.08, 'l2_leaf_reg': 3, 'loss_function': 'RMSE', 'logging_level': 'Silent'} - 训练模型
1
2
3
4
5def train_cat(params,X_train, Y_train,X_test,Y_test):
cat_model = CatBoostRegressor(**params)
cat_model.fit(X_train,Y_train)
Y_pred = cat_model.predict(X_test)
return mean_squared_error(Y_test,Y_pred) - 输出结果
1
2
3
4
5
6print("最原始的CATboost:{}".format(
train_cat(cat_params_without_mean,X_train, Y_train,X_test,Y_test)))
print("增加标准化的CATboost:{}".format(
train_cat(cat_params_with_mean,X_train_mean, Y_train,X_test_mean,Y_test)))
print("PCA标准化的CATboost:{}".format(
train_cat(cat_params_with_mean,X_train_mean_pca, Y_train, X_test_mean_pca,Y_test)))
目前来看,加了标准化列之后的数据准确性略有上升。
在sacikit-learn中,GradientBoostingClassifier为GBDT的分类类, 而GradientBoostingRegressor为GBDT的回归类。两者的参数类型完全相同,当然有些参数比如损失函数loss的可选择项并不相同。
GradientBoostingRegressor
- 调整参数输出
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23gbr_params = {'learning_rate':0.03, 'loss':'huber', 'max_depth':3,
'min_impurity_decrease':0.0, 'min_samples_leaf':1, 'min_samples_split':2,
'n_estimators':100, 'random_state':0, 'subsample':0.8}
back_params = {
'max_depth': [i for i in range(5,15,1)],
'n_estimators': [i for i in range(75,500,25)],
'learning_rate':np.linspace(0.01,0.1,10),
'subsample': np.linspace(0.01,0.1,10),
'min_samples_leaf': [i for i in range(1,15,1)],
'min_samples_split': [i for i in range(2,42,2)]
}
for param in back_params:
temp_param = {param:back_params[param]}
estimator = GradientBoostingRegressor(**gbr_params)
optimized_gbr = GridSearchCV(estimator, param_grid = temp_param,
scoring='neg_mean_squared_error',
cv=5, verbose=False, n_jobs=30)
optimized_gbr.fit(X_train_mean, Y_train)
gbr_params.update(optimized_gbr.best_params_)
print('参数的最佳取值:{0}'.format(optimized_gbr.best_params_))
print('最佳模型得分:{0}'.format(-optimized_gbr.best_score_))
print(gbr_params)1
2
3
4
5
6
7
8
9
10
11
12参数的最佳取值:{'max_depth': 6}
最佳模型得分:0.13130050103839636
参数的最佳取值:{'n_estimators': 475}
最佳模型得分:0.11793377395668483
参数的最佳取值:{'learning_rate': 0.020000000000000004}
最佳模型得分:0.1178641436354182
参数的最佳取值:{'subsample': 0.09000000000000001}
最佳模型得分:0.11920232050809215
参数的最佳取值:{'min_samples_leaf': 5}
最佳模型得分:0.11534924728998576
参数的最佳取值:{'min_samples_split': 2}
最佳模型得分:0.11534924728998576 - 最佳参数一套是刚算的,另外一套是天池来的。
1
2
3
4
5
6
7
8
9gbr_params_with_mean = {'learning_rate': 0.02, 'loss': 'huber', 'max_depth': 6, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 5, 'min_samples_split': 2, 'n_estimators': 475, 'random_state': 0, 'subsample': 0.09}
gbr_params_from_tianchi = {'alpha':0.9, 'criterion':'friedman_mse', 'init':None,
'learning_rate':0.03, 'loss':'huber', 'max_depth':14,
'max_features':'sqrt', 'max_leaf_nodes':None,
'min_impurity_decrease':0.0, 'min_impurity_split':None,
'min_samples_leaf':10, 'min_samples_split':40,
'min_weight_fraction_leaf':0.0, 'n_estimators':300,
'presort':'auto', 'random_state':10, 'subsample':0.8,
'verbose':0,'warm_start':False}
天池上分享的那套,MSE很低,只有 0.89,但是线上效果很差,差不多0.16.
- 计算模型
1
2
3
4
5
6
7
8
9
10
11
12def train_gbr(params,X_train, Y_train,X_test,Y_test):
gbr_model = GradientBoostingRegressor(**params)
gbr_model.fit(X_train,Y_train)
Y_pred = gbr_model.predict(X_test)
return mean_squared_error(Y_test,Y_pred)
print("天池版GBR:{}".format(
train_gbr(gbr_params_from_tianchi,X_train,Y_train,X_test,Y_test)))
print("调优的GBR:{}".format(
train_gbr(gbr_params_with_mean,X_train_mean,Y_train,X_test_mean,Y_test)))
print("天池标准化的GBR:{}".format(
train_gbr(gbr_params_with_mean,X_train,Y_train,X_test,Y_test))) - 输出
1
2
3天池版GBR:0.08918637637655073
调优的GBR:0.0936784270923323
天池标准化的GBR:0.09358718807049533
ensembling
ensembling
是一种听着就很邪乎的办法,大概介绍请查看 KAGGLE ENSEMBLING GUIDE1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17best = [0,0,0,0,10]
for i in np.linspace(0.1,10,100):
for j in np.linspace(0.1,10,100):
for z in np.linspace(0.1,10,100):
for k in np.linspace(0.1,10,100):
Y_predict_mix = (Y_pred_mean_xgb*i + Y_pred_mean_lgbm*j
+ Y_pred_mean_cat*z + Y_pred_mean_gbr*k
) / (i+j+z+k)
temp_mse = mean_squared_error(Y_predict_mix,Y_test)
if best[3] > temp_mse:
best = [i,j,z,k,temp_mse]
print(best)
Y_predict_mix = (Y_pred_mean_xgb*best[0]
+ Y_pred_mean_lgbm*best[1]
+ Y_pred_mean_cat*best[2]
+ Y_pred_mean_gbr*best[3])/(best[0]+best[1]+best[2]+best[3])
print(mean_squared_error(Y_predict_mix,Y_test))
后面检查一下
- 加权计算结果
1
2
3
4
5
6
7
8test_pred_xgb = xgb_model.predict(test_mean)
test_pred_lgbm = LGBM_model.predict(test_mean)
test_pred_cat = cat_model.predict(test_mean)
test_pred_gbr = gbr_model.predict(test_mean)
test_pred_mix = (test_pred_xgb*best[0] + test_pred_lgbm*best[1]
+ test_pred_cat*best[2] + test_pred_gbr*best[3])
/(best[0]+best[1]+best[2]+best[3]) - 保存数据
1
2
3with open("result_121018_LGBM_CAT_XGB_BGR.txt","w") as f1:
temp = "\n".join(str(v) for v in test_pred_mix.tolist())
f1.write(temp)小结
本地的MSE已经只有 0.088了,提交之后,分数是0.1378,还不如XGB单独的效果。
合理推测,test数据应该和train的不是同一套数据。
天池 ·【新人赛】工业蒸汽量预测建模算法【二】