天池 ·【新人赛】工业蒸汽量预测建模算法【二】

之前使用线性模型,DNN,CNN来进行预测,效果一般。

数据处理

  • 读取数据

    1
    2
    3
    4
    5
    6
    7
    8
    train_data_path = "data/zhengqi_train.txt"
    test_data_path = "data/zhengqi_test.txt"

    source = pd.read_table(train_data_path, sep='\t').values
    test_data = pd.read_table(test_data_path, sep='\t').values

    source_data = source[:, 0:-1]
    source_target =source[:, -1]
  • 标准化处理

    1
    2
    3
    4
    5
    6
    # 标准化
    source_mean = (source_data - np.min(source_data, axis=0)) / (np.max(source_data, axis=0)-np.min(source_data, axis=0))
    source_mean = np.column_stack((source_data,source_mean))

    test_mean = (test_data - np.min(test_data, axis=0)) / (np.max(test_data, axis=0)-np.min(test_data, axis=0))
    test_mean = np.column_stack((test_data,test_mean))
  • pca处理

1
2
3
4
pca = PCA(n_components=0.95) #0.95
pca.fit(source_mean)
source_mean_pca = pca.transform(source_mean)
test_mean_pca = pca.transform(test_mean)

标准化和PCA这步,我们还不确定是否真的需要。

  • 数据分割
    1
    2
    3
    X_train, X_test, Y_train, Y_test = train_test_split(source_data, source_target, test_size=0.2, random_state=40)
    X_train_mean, X_test_mean, Y_train, Y_test = train_test_split(source_mean, source_target, test_size=0.2, random_state=40)
    X_train_mean_pca, X_test_mean_pca, Y_train, Y_test = train_test_split(source_mean_pca, source_target, test_size=0.2, random_state=40)

XGB

  • 网格调参

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    xgb_params = {'learning_rate': 0.1, 'n_estimators': 500, 
    'max_depth': 5, 'min_child_weight': 1,
    'seed': 0, 'subsample': 0.8, 'colsample_bytree': 0.8,
    'gamma': 0, 'reg_alpha': 0, 'reg_lambda': 1}
    back_params = {
    'n_estimators': [i for i in range(500,700,10)],
    'learning_rate': np.linspace(0.01,0.2,20),
    'max_depth': [i for i in range(3,11)],
    'min_child_weight': [i for i in range(1,7)],
    'gamma': np.linspace(0,1,11),
    'subsample': np.linspace(0.1,0.9,9),
    'colsample_bytree': np.linspace(0.1,0.9,9),
    'reg_alpha': np.linspace(0.1,3,30),
    'reg_lambda': np.linspace(0.1,3,30),
    }
    for param in back_params:
    temp_param = {param:back_params[param]}
    estimator = xgb.XGBRegressor(**xgb_params)
    optimized_XGB = GridSearchCV(estimator, param_grid = temp_param, scoring='neg_mean_squared_error',
    cv=5, verbose=False, n_jobs=10)
    # optimized_XGB.fit(train, train_target)
    # 加入前 0.10844018858214664
    # 加入平均值后
    optimized_XGB.fit(X_train_mean, Y_train_mean)

    xgb_params.update(optimized_XGB.best_params_)
    print('参数的最佳取值:{0}'.format(optimized_XGB.best_params_))
    print('最佳模型得分:{0}'.format(-optimized_XGB.best_score_))
    print(xgb_params)

    更新过程

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    参数的最佳取值:{'n_estimators': 690}
    最佳模型得分:0.1248743341047175
    参数的最佳取值:{'learning_rate': 0.02}
    最佳模型得分:0.11986767956401062
    参数的最佳取值:{'max_depth': 4}
    最佳模型得分:0.11970729128001024
    参数的最佳取值:{'min_child_weight': 4}
    最佳模型得分:0.1177363914351903
    参数的最佳取值:{'gamma': 0.0}
    最佳模型得分:0.1177363914351903
    参数的最佳取值:{'subsample': 0.4}
    最佳模型得分:0.1162276478005415
    参数的最佳取值:{'colsample_bytree': 0.6}
    最佳模型得分:0.1155322491989233
    参数的最佳取值:{'reg_alpha': 0.5}
    最佳模型得分:0.1149495825651894
    参数的最佳取值:{'reg_lambda': 0.1}
    最佳模型得分:0.11459993910716135
  • 获得参数

    1
    2
    xgb_params_without_mean = {'learning_rate': 0.02, 'n_estimators': 660, 'max_depth': 7, 'min_child_weight': 1, 'seed': 0, 'subsample': 0.1, 'colsample_bytree': 0.2, 'gamma': 0.0, 'reg_alpha': 0.2, 'reg_lambda': 0.3}
    xgb_params_with_mean = {'learning_rate': 0.02, 'n_estimators': 690, 'max_depth': 4, 'min_child_weight': 4, 'seed': 0, 'subsample': 0.4, 'colsample_bytree': 0.6, 'gamma': 0.0, 'reg_alpha': 0.5, 'reg_lambda': 0.1}
  • 训练模型
    1
    2
    3
    4
    5
    def train_xgb(params,X_train, Y_train,X_test,Y_test):
    xgb_model = xgb.XGBRegressor(**params)
    xgb_model.fit(X_train, Y_train)
    Y_pred = xgb_model.predict(X_test)
    return mean_squared_error(Y_test,Y_pred)
  • 测试集效果
    1
    2
    3
    4
    5
    6
    print("最原始的XGB:{}".format(
    train_xgb(xgb_params_without_mean,X_train, Y_train,X_test,Y_test)))
    print("增加标准化的XGB:{}".format(
    train_xgb(xgb_params_with_mean,X_train_mean, Y_train,X_test_mean,Y_test)))
    print("PCA标准化的XGB:{}".format(
    train_xgb(xgb_params_with_mean,X_train_mean_pca, Y_train, X_test_mean_pca,Y_test)))
    获得
    1
    2
    3
    最原始的XGB:0.09825279843671066
    增加标准化的XGB:0.09208487929988615
    PCA标准化的XGB:0.10937159141681482
    PCA参数没有挑,因为PCA目的是为了防止过拟合,测试发现会降低效果。

lightGBM

  • 网格调参

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    LGBM_params = {'num_leaves':50,'max_depth':13,'learning_rate':0.1, 
    'n_estimators':400, 'min_child_weight':1, 'subsample':0.8,
    'colsample_bytree':0.8, 'nthread':7,'objective':'regression'}
    back_params = {
    'n_estimators': [i for i in range(400,900,25)],
    'num_leaves': [i for i in range(10,45,5)],
    'max_depth': [i for i in range(3,11)], 'learning_rate':np.linspace(0.01,0.2,20),
    'min_child_weight': [i for i in range(1,7)],
    'subsample': np.linspace(0.1,0.9,9),
    'colsample_bytree': np.linspace(0.1,0.9,9),
    }
    for param in back_params:
    temp_param = {param:back_params[param]}
    estimator = LGBMRegressor(**LGBM_params)
    optimized_LGBM = GridSearchCV(estimator, param_grid = temp_param, scoring='neg_mean_squared_error',
    cv=5, verbose=False, n_jobs=10)
    optimized_LGBM.fit(X_train_mean, Y_train_mean)

    LGBM_params.update(optimized_LGBM.best_params_)
    print('参数的最佳取值:{0}'.format(optimized_LGBM.best_params_))
    print('最佳模型得分:{0}'.format(-optimized_LGBM.best_score_))
    print(LGBM_params)

    更新过程

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    参数的最佳取值:{'n_estimators': 625}
    最佳模型得分:0.11614724424510146
    参数的最佳取值:{'num_leaves': 20}
    最佳模型得分:0.11266928071626249
    参数的最佳取值:{'max_depth': 5}
    最佳模型得分:0.11318898749529319
    参数的最佳取值:{'learning_rate': 0.1}
    最佳模型得分:0.11318898749529319
    参数的最佳取值:{'min_child_weight': 1}
    最佳模型得分:0.11318898749529319
    参数的最佳取值:{'subsample': 0.1}
    最佳模型得分:0.11318898749529319
    参数的最佳取值:{'colsample_bytree': 0.3}
    最佳模型得分:0.11188844667277635
  • 获得参数

    1
    2
    lgbm_params_without_mean = {'num_leaves': 15, 'max_depth': 10, 'learning_rate': 0.03, 'n_estimators': 775, 'min_child_weight': 1, 'subsample': 0.1, 'colsample_bytree': 0.3, 'nthread': 7, 'objective': 'regression'}
    lgbm_params_with_mean = {'num_leaves': 20, 'max_depth': 5, 'learning_rate': 0.1, 'n_estimators': 625, 'min_child_weight': 1, 'subsample': 0.1, 'colsample_bytree': 0.3, 'nthread': 7, 'objective': 'regression'}
  • 训练模型
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    def train_LGBM(params,X_train, Y_train,X_test,Y_test):
    LGBM_model = LGBMRegressor(**params)
    LGBM_model.fit(X_train, Y_train)
    Y_pred = LGBM_model.predict(X_test)
    return mean_squared_error(Y_test,Y_pred)

    print("最原始的LGBM:{}".format(
    train_LGBM(lgbm_params_without_mean,X_train, Y_train,X_test,Y_test)))
    print("增加标准化的LGBM:{}".format(
    train_LGBM(lgbm_params_with_mean,X_train_mean, Y_train,X_test_mean,Y_test)))
    print("PCA标准化的LGBM:{}".format(
    train_LGBM(lgbm_params_without_mean,X_train_mean_pca, Y_train, X_test_mean_pca,Y_test)))
    获得
    1
    2
    3
    最原始的LGBM:0.09394115199163632
    增加标准化的LGBM:0.09412533355830843
    PCA标准化的LGBM:0.12151567895565014

catboost

  • 网格调参
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    cat_params = {'n_estimators': 82,
    'depth': 5,
    'learning_rate': 0.1,
    'l2_leaf_reg': 3,
    'loss_function': 'RMSE',
    'logging_level': 'Silent'}

    back_params = {
    'n_estimators': [i for i in range(400,900,25)],
    'depth': [i for i in range(1,10,1)],
    'learning_rate':np.linspace(0.01,0.2,20),
    'l2_leaf_reg': [i for i in range(1,5,1)],
    }
    for param in back_params:
    temp_param = {param:back_params[param]}
    estimator = CatBoostRegressor(**cat_params)
    optimized_CAT = GridSearchCV(estimator, param_grid = temp_param, scoring='neg_mean_squared_error',
    cv=5, verbose=False, n_jobs=10)
    optimized_CAT.fit(X_train_mean, Y_train_mean)

    cat_params.update(optimized_CAT.best_params_)
    print('参数的最佳取值:{0}'.format(optimized_CAT.best_params_))
    print('最佳模型得分:{0}'.format(-optimized_CAT.best_score_))
    print(cat_params)
    更新过程
    1
    2
    3
    4
    5
    6
    7
    8
    参数的最佳取值:{'n_estimators': 875}
    最佳模型得分:0.11683246282153908
    参数的最佳取值:{'depth': 4}
    最佳模型得分:0.11546548200806038
    参数的最佳取值:{'learning_rate': 0.08}
    最佳模型得分:0.11471602626521735
    参数的最佳取值:{'l2_leaf_reg': 3}
    最佳模型得分:0.11471602626521735
  • 获得参数
    1
    2
    cat_params_without_mean = {'n_estimators': 875, 'depth': 4, 'learning_rate': 0.1, 'l2_leaf_reg': 3, 'loss_function': 'RMSE', 'logging_level': 'Silent'}
    cat_params_with_mean = {'n_estimators': 875, 'depth': 4, 'learning_rate': 0.08, 'l2_leaf_reg': 3, 'loss_function': 'RMSE', 'logging_level': 'Silent'}
  • 训练模型
    1
    2
    3
    4
    5
    def train_cat(params,X_train, Y_train,X_test,Y_test):
    cat_model = CatBoostRegressor(**params)
    cat_model.fit(X_train,Y_train)
    Y_pred = cat_model.predict(X_test)
    return mean_squared_error(Y_test,Y_pred)
  • 输出结果
    1
    2
    3
    4
    5
    6
    print("最原始的CATboost:{}".format(
    train_cat(cat_params_without_mean,X_train, Y_train,X_test,Y_test)))
    print("增加标准化的CATboost:{}".format(
    train_cat(cat_params_with_mean,X_train_mean, Y_train,X_test_mean,Y_test)))
    print("PCA标准化的CATboost:{}".format(
    train_cat(cat_params_with_mean,X_train_mean_pca, Y_train, X_test_mean_pca,Y_test)))

目前来看,加了标准化列之后的数据准确性略有上升。

在sacikit-learn中,GradientBoostingClassifier为GBDT的分类类, 而GradientBoostingRegressor为GBDT的回归类。两者的参数类型完全相同,当然有些参数比如损失函数loss的可选择项并不相同。

GradientBoostingRegressor

  • 调整参数
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    gbr_params = {'learning_rate':0.03, 'loss':'huber', 'max_depth':3,
    'min_impurity_decrease':0.0, 'min_samples_leaf':1, 'min_samples_split':2,
    'n_estimators':100, 'random_state':0, 'subsample':0.8}
    back_params = {
    'max_depth': [i for i in range(5,15,1)],
    'n_estimators': [i for i in range(75,500,25)],
    'learning_rate':np.linspace(0.01,0.1,10),
    'subsample': np.linspace(0.01,0.1,10),
    'min_samples_leaf': [i for i in range(1,15,1)],
    'min_samples_split': [i for i in range(2,42,2)]
    }
    for param in back_params:
    temp_param = {param:back_params[param]}
    estimator = GradientBoostingRegressor(**gbr_params)
    optimized_gbr = GridSearchCV(estimator, param_grid = temp_param,
    scoring='neg_mean_squared_error',
    cv=5, verbose=False, n_jobs=30)
    optimized_gbr.fit(X_train_mean, Y_train)

    gbr_params.update(optimized_gbr.best_params_)
    print('参数的最佳取值:{0}'.format(optimized_gbr.best_params_))
    print('最佳模型得分:{0}'.format(-optimized_gbr.best_score_))
    print(gbr_params)
    输出
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    参数的最佳取值:{'max_depth': 6}
    最佳模型得分:0.13130050103839636
    参数的最佳取值:{'n_estimators': 475}
    最佳模型得分:0.11793377395668483
    参数的最佳取值:{'learning_rate': 0.020000000000000004}
    最佳模型得分:0.1178641436354182
    参数的最佳取值:{'subsample': 0.09000000000000001}
    最佳模型得分:0.11920232050809215
    参数的最佳取值:{'min_samples_leaf': 5}
    最佳模型得分:0.11534924728998576
    参数的最佳取值:{'min_samples_split': 2}
    最佳模型得分:0.11534924728998576
  • 最佳参数
    1
    2
    3
    4
    5
    6
    7
    8
    9
    gbr_params_with_mean = {'learning_rate': 0.02, 'loss': 'huber', 'max_depth': 6, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 5, 'min_samples_split': 2, 'n_estimators': 475, 'random_state': 0, 'subsample': 0.09}
    gbr_params_from_tianchi = {'alpha':0.9, 'criterion':'friedman_mse', 'init':None,
    'learning_rate':0.03, 'loss':'huber', 'max_depth':14,
    'max_features':'sqrt', 'max_leaf_nodes':None,
    'min_impurity_decrease':0.0, 'min_impurity_split':None,
    'min_samples_leaf':10, 'min_samples_split':40,
    'min_weight_fraction_leaf':0.0, 'n_estimators':300,
    'presort':'auto', 'random_state':10, 'subsample':0.8,
    'verbose':0,'warm_start':False}
    一套是刚算的,另外一套是天池来的。

天池上分享的那套,MSE很低,只有 0.89,但是线上效果很差,差不多0.16.

  • 计算模型
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    def train_gbr(params,X_train, Y_train,X_test,Y_test):
    gbr_model = GradientBoostingRegressor(**params)
    gbr_model.fit(X_train,Y_train)
    Y_pred = gbr_model.predict(X_test)
    return mean_squared_error(Y_test,Y_pred)

    print("天池版GBR:{}".format(
    train_gbr(gbr_params_from_tianchi,X_train,Y_train,X_test,Y_test)))
    print("调优的GBR:{}".format(
    train_gbr(gbr_params_with_mean,X_train_mean,Y_train,X_test_mean,Y_test)))
    print("天池标准化的GBR:{}".format(
    train_gbr(gbr_params_with_mean,X_train,Y_train,X_test,Y_test)))
  • 输出
    1
    2
    3
    天池版GBR:0.08918637637655073
    调优的GBR:0.0936784270923323
    天池标准化的GBR:0.09358718807049533

ensembling

ensembling 是一种听着就很邪乎的办法,大概介绍请查看 KAGGLE ENSEMBLING GUIDE

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
best = [0,0,0,0,10]
for i in np.linspace(0.1,10,100):
for j in np.linspace(0.1,10,100):
for z in np.linspace(0.1,10,100):
for k in np.linspace(0.1,10,100):
Y_predict_mix = (Y_pred_mean_xgb*i + Y_pred_mean_lgbm*j
+ Y_pred_mean_cat*z + Y_pred_mean_gbr*k
) / (i+j+z+k)
temp_mse = mean_squared_error(Y_predict_mix,Y_test)
if best[3] > temp_mse:
best = [i,j,z,k,temp_mse]
print(best)
Y_predict_mix = (Y_pred_mean_xgb*best[0]
+ Y_pred_mean_lgbm*best[1]
+ Y_pred_mean_cat*best[2]
+ Y_pred_mean_gbr*best[3])/(best[0]+best[1]+best[2]+best[3])
print(mean_squared_error(Y_predict_mix,Y_test))

后面检查一下

  • 加权计算结果
    1
    2
    3
    4
    5
    6
    7
    8
    test_pred_xgb = xgb_model.predict(test_mean)
    test_pred_lgbm = LGBM_model.predict(test_mean)
    test_pred_cat = cat_model.predict(test_mean)
    test_pred_gbr = gbr_model.predict(test_mean)

    test_pred_mix = (test_pred_xgb*best[0] + test_pred_lgbm*best[1]
    + test_pred_cat*best[2] + test_pred_gbr*best[3])
    /(best[0]+best[1]+best[2]+best[3])
  • 保存数据
    1
    2
    3
    with open("result_121018_LGBM_CAT_XGB_BGR.txt","w") as f1:
    temp = "\n".join(str(v) for v in test_pred_mix.tolist())
    f1.write(temp)

    小结

    本地的MSE已经只有 0.088了,提交之后,分数是0.1378,还不如XGB单独的效果。

合理推测,test数据应该和train的不是同一套数据。

天池 ·【新人赛】工业蒸汽量预测建模算法【二】

https://iii.run/archives/716ba0c2bb65.html

作者

mmmwhy

发布于

2018-12-10

更新于

2022-10-30

许可协议

评论