天池 ·【新人赛】工业蒸汽量预测建模算法【二】
in 竞赛 with 0 comment

天池 ·【新人赛】工业蒸汽量预测建模算法【二】

?> with 0 comment

之前使用线性模型,DNN,CNN来进行预测,效果一般。

数据处理

train_data_path = "data/zhengqi_train.txt"
test_data_path = "data/zhengqi_test.txt"

source = pd.read_table(train_data_path, sep='\t').values
test_data = pd.read_table(test_data_path, sep='\t').values

source_data = source[:, 0:-1]
source_target =source[:, -1]
# 标准化
source_mean = (source_data - np.min(source_data, axis=0)) / (np.max(source_data, axis=0)-np.min(source_data, axis=0))
source_mean = np.column_stack((source_data,source_mean))

test_mean = (test_data - np.min(test_data, axis=0)) / (np.max(test_data, axis=0)-np.min(test_data, axis=0))
test_mean = np.column_stack((test_data,test_mean))
pca = PCA(n_components=0.95) #0.95
pca.fit(source_mean)
source_mean_pca = pca.transform(source_mean)
test_mean_pca = pca.transform(test_mean)

标准化和PCA这步,我们还不确定是否真的需要。

X_train, X_test, Y_train, Y_test = train_test_split(source_data, source_target, test_size=0.2, random_state=40)
X_train_mean, X_test_mean, Y_train, Y_test = train_test_split(source_mean, source_target, test_size=0.2, random_state=40)
X_train_mean_pca, X_test_mean_pca, Y_train, Y_test = train_test_split(source_mean_pca, source_target, test_size=0.2, random_state=40)

XGB

xgb_params = {'learning_rate': 0.1, 'n_estimators': 500, 
              'max_depth': 5, 'min_child_weight': 1, 
              'seed': 0, 'subsample': 0.8, 'colsample_bytree': 0.8, 
              'gamma': 0, 'reg_alpha': 0, 'reg_lambda': 1}
back_params = {
    'n_estimators': [i for i in range(500,700,10)],
    'learning_rate': np.linspace(0.01,0.2,20),
    'max_depth': [i for i in range(3,11)], 
    'min_child_weight': [i for i in range(1,7)],
    'gamma': np.linspace(0,1,11),
    'subsample': np.linspace(0.1,0.9,9), 
    'colsample_bytree': np.linspace(0.1,0.9,9),
    'reg_alpha': np.linspace(0.1,3,30), 
    'reg_lambda': np.linspace(0.1,3,30),
}
for param in back_params:
    temp_param = {param:back_params[param]}
    estimator = xgb.XGBRegressor(**xgb_params)
    optimized_XGB = GridSearchCV(estimator, param_grid = temp_param, scoring='neg_mean_squared_error',
                           cv=5, verbose=False, n_jobs=10)
    # optimized_XGB.fit(train, train_target)
    # 加入前 0.10844018858214664
    # 加入平均值后
    optimized_XGB.fit(X_train_mean, Y_train_mean)
    
    xgb_params.update(optimized_XGB.best_params_)
    print('参数的最佳取值:{0}'.format(optimized_XGB.best_params_))
    print('最佳模型得分:{0}'.format(-optimized_XGB.best_score_))
print(xgb_params)

更新过程

参数的最佳取值:{'n_estimators': 690}
最佳模型得分:0.1248743341047175
参数的最佳取值:{'learning_rate': 0.02}
最佳模型得分:0.11986767956401062
参数的最佳取值:{'max_depth': 4}
最佳模型得分:0.11970729128001024
参数的最佳取值:{'min_child_weight': 4}
最佳模型得分:0.1177363914351903
参数的最佳取值:{'gamma': 0.0}
最佳模型得分:0.1177363914351903
参数的最佳取值:{'subsample': 0.4}
最佳模型得分:0.1162276478005415
参数的最佳取值:{'colsample_bytree': 0.6}
最佳模型得分:0.1155322491989233
参数的最佳取值:{'reg_alpha': 0.5}
最佳模型得分:0.1149495825651894
参数的最佳取值:{'reg_lambda': 0.1}
最佳模型得分:0.11459993910716135
xgb_params_without_mean = {'learning_rate': 0.02, 'n_estimators': 660, 'max_depth': 7, 'min_child_weight': 1, 'seed': 0, 'subsample': 0.1, 'colsample_bytree': 0.2, 'gamma': 0.0, 'reg_alpha': 0.2, 'reg_lambda': 0.3}
xgb_params_with_mean = {'learning_rate': 0.02, 'n_estimators': 690, 'max_depth': 4, 'min_child_weight': 4, 'seed': 0, 'subsample': 0.4, 'colsample_bytree': 0.6, 'gamma': 0.0, 'reg_alpha': 0.5, 'reg_lambda': 0.1}
def train_xgb(params,X_train, Y_train,X_test,Y_test):
    xgb_model = xgb.XGBRegressor(**params)
    xgb_model.fit(X_train, Y_train)
    Y_pred = xgb_model.predict(X_test)
    return mean_squared_error(Y_test,Y_pred)
print("最原始的XGB:{}".format(
    train_xgb(xgb_params_without_mean,X_train, Y_train,X_test,Y_test)))
print("增加标准化的XGB:{}".format(
    train_xgb(xgb_params_with_mean,X_train_mean, Y_train,X_test_mean,Y_test)))
print("PCA标准化的XGB:{}".format(
    train_xgb(xgb_params_with_mean,X_train_mean_pca, Y_train, X_test_mean_pca,Y_test)))

获得

最原始的XGB:0.09825279843671066
增加标准化的XGB:0.09208487929988615
PCA标准化的XGB:0.10937159141681482

PCA参数没有挑,因为PCA目的是为了防止过拟合,测试发现会降低效果。

lightGBM

LGBM_params = {'num_leaves':50,'max_depth':13,'learning_rate':0.1, 
    'n_estimators':400, 'min_child_weight':1, 'subsample':0.8,
    'colsample_bytree':0.8, 'nthread':7,'objective':'regression'}
back_params = {
    'n_estimators': [i for i in range(400,900,25)],
    'num_leaves': [i for i in range(10,45,5)],
    'max_depth': [i for i in range(3,11)],    'learning_rate':np.linspace(0.01,0.2,20),
    'min_child_weight': [i for i in range(1,7)],
    'subsample': np.linspace(0.1,0.9,9),
    'colsample_bytree': np.linspace(0.1,0.9,9),
}
for param in back_params:
    temp_param = {param:back_params[param]}
    estimator = LGBMRegressor(**LGBM_params)
    optimized_LGBM = GridSearchCV(estimator, param_grid = temp_param, scoring='neg_mean_squared_error',
                           cv=5, verbose=False, n_jobs=10)
    optimized_LGBM.fit(X_train_mean, Y_train_mean)
    
    LGBM_params.update(optimized_LGBM.best_params_)
    print('参数的最佳取值:{0}'.format(optimized_LGBM.best_params_))
    print('最佳模型得分:{0}'.format(-optimized_LGBM.best_score_))
print(LGBM_params)

更新过程

参数的最佳取值:{'n_estimators': 625}
最佳模型得分:0.11614724424510146
参数的最佳取值:{'num_leaves': 20}
最佳模型得分:0.11266928071626249
参数的最佳取值:{'max_depth': 5}
最佳模型得分:0.11318898749529319
参数的最佳取值:{'learning_rate': 0.1}
最佳模型得分:0.11318898749529319
参数的最佳取值:{'min_child_weight': 1}
最佳模型得分:0.11318898749529319
参数的最佳取值:{'subsample': 0.1}
最佳模型得分:0.11318898749529319
参数的最佳取值:{'colsample_bytree': 0.3}
最佳模型得分:0.11188844667277635
lgbm_params_without_mean = {'num_leaves': 15, 'max_depth': 10, 'learning_rate': 0.03, 'n_estimators': 775, 'min_child_weight': 1, 'subsample': 0.1, 'colsample_bytree': 0.3, 'nthread': 7, 'objective': 'regression'}
lgbm_params_with_mean = {'num_leaves': 20, 'max_depth': 5, 'learning_rate': 0.1, 'n_estimators': 625, 'min_child_weight': 1, 'subsample': 0.1, 'colsample_bytree': 0.3, 'nthread': 7, 'objective': 'regression'}
def train_LGBM(params,X_train, Y_train,X_test,Y_test):
    LGBM_model = LGBMRegressor(**params)
    LGBM_model.fit(X_train, Y_train)
    Y_pred = LGBM_model.predict(X_test)
    return mean_squared_error(Y_test,Y_pred)

print("最原始的LGBM:{}".format(
    train_LGBM(lgbm_params_without_mean,X_train, Y_train,X_test,Y_test)))
print("增加标准化的LGBM:{}".format(
    train_LGBM(lgbm_params_with_mean,X_train_mean, Y_train,X_test_mean,Y_test)))
print("PCA标准化的LGBM:{}".format(
    train_LGBM(lgbm_params_without_mean,X_train_mean_pca, Y_train, X_test_mean_pca,Y_test)))

获得

最原始的LGBM:0.09394115199163632
增加标准化的LGBM:0.09412533355830843
PCA标准化的LGBM:0.12151567895565014

catboost

cat_params = {'n_estimators': 82,
             'depth': 5,
             'learning_rate': 0.1,
             'l2_leaf_reg': 3,
             'loss_function': 'RMSE',
             'logging_level': 'Silent'}

back_params = {
    'n_estimators': [i for i in range(400,900,25)],
    'depth': [i for i in range(1,10,1)],
    'learning_rate':np.linspace(0.01,0.2,20),
    'l2_leaf_reg': [i for i in range(1,5,1)],
}
for param in back_params:
    temp_param = {param:back_params[param]}
    estimator = CatBoostRegressor(**cat_params)
    optimized_CAT = GridSearchCV(estimator, param_grid = temp_param, scoring='neg_mean_squared_error',
                           cv=5, verbose=False, n_jobs=10)
    optimized_CAT.fit(X_train_mean, Y_train_mean)
    
    cat_params.update(optimized_CAT.best_params_)
    print('参数的最佳取值:{0}'.format(optimized_CAT.best_params_))
    print('最佳模型得分:{0}'.format(-optimized_CAT.best_score_))
print(cat_params)

更新过程

参数的最佳取值:{'n_estimators': 875}
最佳模型得分:0.11683246282153908
参数的最佳取值:{'depth': 4}
最佳模型得分:0.11546548200806038
参数的最佳取值:{'learning_rate': 0.08}
最佳模型得分:0.11471602626521735
参数的最佳取值:{'l2_leaf_reg': 3}
最佳模型得分:0.11471602626521735
cat_params_without_mean = {'n_estimators': 875, 'depth': 4, 'learning_rate': 0.1, 'l2_leaf_reg': 3, 'loss_function': 'RMSE', 'logging_level': 'Silent'}
cat_params_with_mean = {'n_estimators': 875, 'depth': 4, 'learning_rate': 0.08, 'l2_leaf_reg': 3, 'loss_function': 'RMSE', 'logging_level': 'Silent'}
def train_cat(params,X_train, Y_train,X_test,Y_test):
    cat_model = CatBoostRegressor(**params)
    cat_model.fit(X_train,Y_train)
    Y_pred = cat_model.predict(X_test)
    return mean_squared_error(Y_test,Y_pred)
print("最原始的CATboost:{}".format(
    train_cat(cat_params_without_mean,X_train, Y_train,X_test,Y_test)))
print("增加标准化的CATboost:{}".format(
    train_cat(cat_params_with_mean,X_train_mean, Y_train,X_test_mean,Y_test)))
print("PCA标准化的CATboost:{}".format(
    train_cat(cat_params_with_mean,X_train_mean_pca, Y_train, X_test_mean_pca,Y_test)))

目前来看,加了标准化列之后的数据准确性略有上升。

在sacikit-learn中,GradientBoostingClassifier为GBDT的分类类, 而GradientBoostingRegressor为GBDT的回归类。两者的参数类型完全相同,当然有些参数比如损失函数loss的可选择项并不相同。

GradientBoostingRegressor

gbr_params = {'learning_rate':0.03, 'loss':'huber', 'max_depth':3,
              'min_impurity_decrease':0.0, 'min_samples_leaf':1, 'min_samples_split':2,
              'n_estimators':100, 'random_state':0, 'subsample':0.8}
back_params = {
    'max_depth': [i for i in range(5,15,1)],
    'n_estimators': [i for i in range(75,500,25)],
    'learning_rate':np.linspace(0.01,0.1,10),
    'subsample': np.linspace(0.01,0.1,10),
    'min_samples_leaf': [i for i in range(1,15,1)],
    'min_samples_split': [i for i in range(2,42,2)]
}
for param in back_params:
    temp_param = {param:back_params[param]}
    estimator = GradientBoostingRegressor(**gbr_params)
    optimized_gbr = GridSearchCV(estimator, param_grid = temp_param, 
                                 scoring='neg_mean_squared_error',
                                 cv=5, verbose=False, n_jobs=30)
    optimized_gbr.fit(X_train_mean, Y_train)
    
    gbr_params.update(optimized_gbr.best_params_)
    print('参数的最佳取值:{0}'.format(optimized_gbr.best_params_))
    print('最佳模型得分:{0}'.format(-optimized_gbr.best_score_))
print(gbr_params)

输出

参数的最佳取值:{'max_depth': 6}
最佳模型得分:0.13130050103839636
参数的最佳取值:{'n_estimators': 475}
最佳模型得分:0.11793377395668483
参数的最佳取值:{'learning_rate': 0.020000000000000004}
最佳模型得分:0.1178641436354182
参数的最佳取值:{'subsample': 0.09000000000000001}
最佳模型得分:0.11920232050809215
参数的最佳取值:{'min_samples_leaf': 5}
最佳模型得分:0.11534924728998576
参数的最佳取值:{'min_samples_split': 2}
最佳模型得分:0.11534924728998576
gbr_params_with_mean = {'learning_rate': 0.02, 'loss': 'huber', 'max_depth': 6, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 5, 'min_samples_split': 2, 'n_estimators': 475, 'random_state': 0, 'subsample': 0.09}
gbr_params_from_tianchi = {'alpha':0.9, 'criterion':'friedman_mse', 'init':None, 
              'learning_rate':0.03, 'loss':'huber', 'max_depth':14,
              'max_features':'sqrt', 'max_leaf_nodes':None,
              'min_impurity_decrease':0.0, 'min_impurity_split':None,
              'min_samples_leaf':10, 'min_samples_split':40,
              'min_weight_fraction_leaf':0.0, 'n_estimators':300,
              'presort':'auto', 'random_state':10, 'subsample':0.8, 
              'verbose':0,'warm_start':False}

一套是刚算的,另外一套是天池来的。

天池上分享的那套,MSE很低,只有 0.89,但是线上效果很差,差不多0.16.

def train_gbr(params,X_train, Y_train,X_test,Y_test):
    gbr_model = GradientBoostingRegressor(**params)
    gbr_model.fit(X_train,Y_train)
    Y_pred = gbr_model.predict(X_test)
    return mean_squared_error(Y_test,Y_pred)

print("天池版GBR:{}".format(
    train_gbr(gbr_params_from_tianchi,X_train,Y_train,X_test,Y_test)))
print("调优的GBR:{}".format(
    train_gbr(gbr_params_with_mean,X_train_mean,Y_train,X_test_mean,Y_test)))
print("天池标准化的GBR:{}".format(
    train_gbr(gbr_params_with_mean,X_train,Y_train,X_test,Y_test)))
天池版GBR:0.08918637637655073
调优的GBR:0.0936784270923323
天池标准化的GBR:0.09358718807049533

ensembling

ensembling 是一种听着就很邪乎的办法,大概介绍请查看 KAGGLE ENSEMBLING GUIDE

best = [0,0,0,0,10]
for i in np.linspace(0.1,10,100):
    for j in np.linspace(0.1,10,100):
        for z in np.linspace(0.1,10,100):
            for k in np.linspace(0.1,10,100):
                Y_predict_mix = (Y_pred_mean_xgb*i + Y_pred_mean_lgbm*j
                                 + Y_pred_mean_cat*z + Y_pred_mean_gbr*k
                                ) / (i+j+z+k)
                temp_mse = mean_squared_error(Y_predict_mix,Y_test)
                if best[3] > temp_mse:
                    best = [i,j,z,k,temp_mse]
print(best)
Y_predict_mix = (Y_pred_mean_xgb*best[0] 
                             + Y_pred_mean_lgbm*best[1]
                             + Y_pred_mean_cat*best[2]
                             + Y_pred_mean_gbr*best[3])/(best[0]+best[1]+best[2]+best[3])
print(mean_squared_error(Y_predict_mix,Y_test))

后面检查一下

test_pred_xgb = xgb_model.predict(test_mean)
test_pred_lgbm = LGBM_model.predict(test_mean)
test_pred_cat = cat_model.predict(test_mean)
test_pred_gbr = gbr_model.predict(test_mean)

test_pred_mix = (test_pred_xgb*best[0] + test_pred_lgbm*best[1]
                 + test_pred_cat*best[2] + test_pred_gbr*best[3])
                /(best[0]+best[1]+best[2]+best[3])
with open("result_121018_LGBM_CAT_XGB_BGR.txt","w") as f1:
    temp = "\n".join(str(v) for v in test_pred_mix.tolist())
    f1.write(temp)

小结

本地的MSE已经只有 0.088了,提交之后,分数是0.1378,还不如XGB单独的效果。

合理推测,test数据应该和train的不是同一套数据。

Responses

From now on, bravely dream and run toward that dream.
陕ICP备17001447号·苏公网安备 32059002001895号