天池 ·【新人赛】工业蒸汽量预测建模算法【三】
日常提交记录
数据统一处理
1 | train_data_path = "data/zhengqi_train.txt" |
根据重要性筛选
- 调好的参数
1
xgb_params_with_mean = {'learning_rate': 0.02, 'n_estimators': 690, 'max_depth': 4, 'min_child_weight': 4, 'seed': 0, 'subsample': 0.4, 'colsample_bytree': 0.6, 'gamma': 0.0, 'reg_alpha': 0.5, 'reg_lambda': 0.1}
查看训练集和 测试集的MSE
1
2
3
4
5
6
7
8xgb_model = xgb.XGBRegressor(**xgb_params_with_mean)
xgb_model.fit(X_train_data_and_mean, Y_train)
## train准确性
Y_train_pred = xgb_model.predict(X_train_data_and_mean)
print("训练集MSE:{}".format(mean_squared_error(Y_train,Y_train_pred)))
## test准确性
Y_test_pred = xgb_model.predict(X_test_data_and_mean)
print("测试集MSE:{}".format(mean_squared_error(Y_test,Y_test_pred)))输出的:
1
2训练集MSE:0.039446435562505315
测试集MSE:0.09208487929988615查看重要性
1
2
3
4
5
6data_weight = xgb_model.get_booster().get_score(importance_type='weight')
sort_weight = sorted(data_weight.items(), key=lambda k:k[1], reverse=True)
important = []
for i in sort_weight:
if i[1] > 80:
important.append(int(i[0][1:]))权重阈值80可以改的
挑选权重后
1
2
3
4
5
6
7
8# important之后的
xgb_model.fit(X_train_data_and_mean[:,important], Y_train)
## train准确性
Y_train_pred = xgb_model.predict(X_train_data_and_mean[:,important])
print("训练集MSE:{}".format(mean_squared_error(Y_train,Y_train_pred)))
## test准确性
Y_test_pred = xgb_model.predict(X_test_data_and_mean[:,important])
print("测试集MSE:{}".format(mean_squared_error(Y_test,Y_test_pred)))输出
1
2训练集MSE:0.04120352010364791
测试集MSE:0.09097951734421819测试集0.09
提交
1
2
3
4Y_pred_mean_xgb_important = xgb_model.predict(test_data_and_mean[:,important])
with open("result_121217_important_xgb.txt","w") as f1:
temp = "\n".join(str(v) for v in Y_pred_mean_xgb_important.tolist())
f1.write(temp)分数竟然是 0.3614 …….
树深度为2
目前的主要问题是过拟合的过于明显,表现为 训练集的MSE远小于测试集的MSE
只保留标准化的数据,不保留原始数据。1
2
3
4
5
6
7
8
9
10
11
12
13xgb_params = {'learning_rate': 0.01, 'n_estimators': 700,
'max_depth': 2, 'min_child_weight': 1,
'seed': 0, 'subsample': 0.8, 'colsample_bytree': 0.8,
'gamma': 0, 'reg_alpha': 0, 'reg_lambda': 1}
xgb_model = xgb.XGBRegressor(**xgb_params)
xgb_model.fit(X_train_mean, Y_train)
## train准确性
Y_train_pred = xgb_model.predict(X_train_mean)
print("训练集MSE:{}".format(mean_squared_error(Y_train,Y_train_pred)))
## test准确性
Y_test_pred = xgb_model.predict(X_test_mean)
print("测试集MSE:{}".format(mean_squared_error(Y_test,Y_test_pred)))
输出1
2训练集MSE:0.10528870568531111
测试集MSE:0.1022618059668572
减少树的深度,希望能提高泛化能力。
- 早上先提交一版试试
1
2
3
4Y_pred_xgb = xgb_model.predict(test_data)
with open("result_121309_xgb.txt","w") as f1:
temp = "\n".join(str(v) for v in Y_pred_xgb.tolist())
f1.write(temp)
0.1907 …..
无参数标准化
比较忙,随便写一个版本
1 | xgb_model = xgb.XGBRegressor() |
训练集MSE:0.07184510530700054
测试集MSE:0.09946644428163588
结果为: 0.1905 感觉越来越不行了呀….
天池 ·【新人赛】工业蒸汽量预测建模算法【三】