天池 ·【新人赛】工业蒸汽量预测建模算法【三】

日常提交记录

数据统一处理

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
train_data_path = "data/zhengqi_train.txt"
test_data_path = "data/zhengqi_test.txt"

source = pd.read_table(train_data_path, sep='\t').values
test_data = pd.read_table(test_data_path, sep='\t').values

source_data = source[:, 0:-1]
source_target =source[:, -1]

# 标准化
source_mean = (source_data - np.min(source_data, axis=0)) / (np.max(source_data, axis=0)-np.min(source_data, axis=0))
source_data_and_mean = np.column_stack((source_data,source_mean))

test_mean = (test_data - np.min(test_data, axis=0)) / (np.max(test_data, axis=0)-np.min(test_data, axis=0))
test_data_and_mean = np.column_stack((test_data,test_mean))

# pca 处理
pca = PCA(n_components=0.95) #0.95
pca.fit(source_data_and_mean)
source_mean_pca = pca.transform(source_data_and_mean)
test_mean_pca = pca.transform(test_data_and_mean)

# 分割数据
# 原始数据
X_train, X_test, Y_train, Y_test = \
train_test_split(source_data, source_target, test_size=0.2, random_state=40)

# 仅有标准化部分的数据
X_train_mean, X_test_mean, Y_train, Y_test = \
train_test_split(source_mean, source_target, test_size=0.2, random_state=40)

# 原始数据+标准化数据
X_train_data_and_mean, X_test_data_and_mean, Y_train, Y_test = \
train_test_split(source_data_and_mean, source_target, test_size=0.2, random_state=40)

# pca后的原始数据+标准数据
X_train_mean_pca, X_test_mean_pca, Y_train, Y_test = \
train_test_split(source_mean_pca, source_target, test_size=0.2, random_state=40)

根据重要性筛选

  • 调好的参数
    1
    xgb_params_with_mean = {'learning_rate': 0.02, 'n_estimators': 690, 'max_depth': 4, 'min_child_weight': 4, 'seed': 0, 'subsample': 0.4, 'colsample_bytree': 0.6, 'gamma': 0.0, 'reg_alpha': 0.5, 'reg_lambda': 0.1}
  • 查看训练集和 测试集的MSE

    1
    2
    3
    4
    5
    6
    7
    8
    xgb_model = xgb.XGBRegressor(**xgb_params_with_mean)
    xgb_model.fit(X_train_data_and_mean, Y_train)
    ## train准确性
    Y_train_pred = xgb_model.predict(X_train_data_and_mean)
    print("训练集MSE:{}".format(mean_squared_error(Y_train,Y_train_pred)))
    ## test准确性
    Y_test_pred = xgb_model.predict(X_test_data_and_mean)
    print("测试集MSE:{}".format(mean_squared_error(Y_test,Y_test_pred)))

    输出的:

    1
    2
    训练集MSE:0.039446435562505315
    测试集MSE:0.09208487929988615
  • 查看重要性

    1
    2
    3
    4
    5
    6
    data_weight = xgb_model.get_booster().get_score(importance_type='weight')
    sort_weight = sorted(data_weight.items(), key=lambda k:k[1], reverse=True)
    important = []
    for i in sort_weight:
    if i[1] > 80:
    important.append(int(i[0][1:]))

    权重阈值80可以改的

  • 挑选权重后

    1
    2
    3
    4
    5
    6
    7
    8
    # important之后的
    xgb_model.fit(X_train_data_and_mean[:,important], Y_train)
    ## train准确性
    Y_train_pred = xgb_model.predict(X_train_data_and_mean[:,important])
    print("训练集MSE:{}".format(mean_squared_error(Y_train,Y_train_pred)))
    ## test准确性
    Y_test_pred = xgb_model.predict(X_test_data_and_mean[:,important])
    print("测试集MSE:{}".format(mean_squared_error(Y_test,Y_test_pred)))

    输出

    1
    2
    训练集MSE:0.04120352010364791
    测试集MSE:0.09097951734421819

    测试集0.09

  • 提交

    1
    2
    3
    4
    Y_pred_mean_xgb_important = xgb_model.predict(test_data_and_mean[:,important])
    with open("result_121217_important_xgb.txt","w") as f1:
    temp = "\n".join(str(v) for v in Y_pred_mean_xgb_important.tolist())
    f1.write(temp)

    分数竟然是 0.3614 …….

树深度为2

目前的主要问题是过拟合的过于明显,表现为 训练集的MSE远小于测试集的MSE

只保留标准化的数据,不保留原始数据。

1
2
3
4
5
6
7
8
9
10
11
12
13
xgb_params = {'learning_rate': 0.01, 'n_estimators': 700, 
'max_depth': 2, 'min_child_weight': 1,
'seed': 0, 'subsample': 0.8, 'colsample_bytree': 0.8,
'gamma': 0, 'reg_alpha': 0, 'reg_lambda': 1}

xgb_model = xgb.XGBRegressor(**xgb_params)
xgb_model.fit(X_train_mean, Y_train)
## train准确性
Y_train_pred = xgb_model.predict(X_train_mean)
print("训练集MSE:{}".format(mean_squared_error(Y_train,Y_train_pred)))
## test准确性
Y_test_pred = xgb_model.predict(X_test_mean)
print("测试集MSE:{}".format(mean_squared_error(Y_test,Y_test_pred)))

输出
1
2
训练集MSE:0.10528870568531111
测试集MSE:0.1022618059668572

减少树的深度,希望能提高泛化能力。

  • 早上先提交一版试试
    1
    2
    3
    4
    Y_pred_xgb = xgb_model.predict(test_data)
    with open("result_121309_xgb.txt","w") as f1:
    temp = "\n".join(str(v) for v in Y_pred_xgb.tolist())
    f1.write(temp)

0.1907 …..

无参数标准化

比较忙,随便写一个版本

1
2
3
4
5
6
7
8
xgb_model = xgb.XGBRegressor()
xgb_model.fit(X_train_mean, Y_train)
## train准确性
Y_train_pred = xgb_model.predict(X_train_mean)
print("训练集MSE:{}".format(mean_squared_error(Y_train,Y_train_pred)))
## test准确性
Y_test_pred = xgb_model.predict(X_test_mean)
print("测试集MSE:{}".format(mean_squared_error(Y_test,Y_test_pred)))

训练集MSE:0.07184510530700054
测试集MSE:0.09946644428163588

结果为: 0.1905 感觉越来越不行了呀….

天池 ·【新人赛】工业蒸汽量预测建模算法【三】

https://iii.run/archives/6ae58ea94251.html

作者

mmmwhy

发布于

2018-12-11

更新于

2022-10-30

许可协议

评论