天池 ·【新人赛】工业蒸汽量预测建模算法【三】
in 竞赛 with 0 comment

天池 ·【新人赛】工业蒸汽量预测建模算法【三】

?> with 0 comment

日常提交记录

数据统一处理

train_data_path = "data/zhengqi_train.txt"
test_data_path = "data/zhengqi_test.txt"

source = pd.read_table(train_data_path, sep='\t').values
test_data = pd.read_table(test_data_path, sep='\t').values

source_data = source[:, 0:-1]
source_target =source[:, -1]

# 标准化
source_mean = (source_data - np.min(source_data, axis=0)) / (np.max(source_data, axis=0)-np.min(source_data, axis=0))
source_data_and_mean = np.column_stack((source_data,source_mean))

test_mean = (test_data - np.min(test_data, axis=0)) / (np.max(test_data, axis=0)-np.min(test_data, axis=0))
test_data_and_mean = np.column_stack((test_data,test_mean))

# pca 处理
pca = PCA(n_components=0.95) #0.95
pca.fit(source_data_and_mean)
source_mean_pca = pca.transform(source_data_and_mean)
test_mean_pca = pca.transform(test_data_and_mean)

# 分割数据
# 原始数据
X_train, X_test, Y_train, Y_test = \
    train_test_split(source_data, source_target, test_size=0.2, random_state=40)

# 仅有标准化部分的数据
X_train_mean, X_test_mean, Y_train, Y_test = \
    train_test_split(source_mean, source_target, test_size=0.2, random_state=40)

# 原始数据+标准化数据
X_train_data_and_mean, X_test_data_and_mean, Y_train, Y_test = \
    train_test_split(source_data_and_mean, source_target, test_size=0.2, random_state=40)

# pca后的原始数据+标准数据
X_train_mean_pca, X_test_mean_pca, Y_train, Y_test = \
    train_test_split(source_mean_pca, source_target, test_size=0.2, random_state=40)

根据重要性筛选

xgb_params_with_mean = {'learning_rate': 0.02, 'n_estimators': 690, 'max_depth': 4, 'min_child_weight': 4, 'seed': 0, 'subsample': 0.4, 'colsample_bytree': 0.6, 'gamma': 0.0, 'reg_alpha': 0.5, 'reg_lambda': 0.1}
xgb_model = xgb.XGBRegressor(**xgb_params_with_mean)
xgb_model.fit(X_train_data_and_mean, Y_train)
## train准确性
Y_train_pred = xgb_model.predict(X_train_data_and_mean)
print("训练集MSE:{}".format(mean_squared_error(Y_train,Y_train_pred)))
## test准确性
Y_test_pred = xgb_model.predict(X_test_data_and_mean)
print("测试集MSE:{}".format(mean_squared_error(Y_test,Y_test_pred)))

输出的:

训练集MSE:0.039446435562505315
测试集MSE:0.09208487929988615
data_weight = xgb_model.get_booster().get_score(importance_type='weight')
sort_weight = sorted(data_weight.items(), key=lambda k:k[1], reverse=True)
important = []
for i in sort_weight:
    if i[1] > 80:
        important.append(int(i[0][1:]))

权重阈值80可以改的

# important之后的
xgb_model.fit(X_train_data_and_mean[:,important], Y_train)
## train准确性
Y_train_pred = xgb_model.predict(X_train_data_and_mean[:,important])
print("训练集MSE:{}".format(mean_squared_error(Y_train,Y_train_pred)))
## test准确性
Y_test_pred = xgb_model.predict(X_test_data_and_mean[:,important])
print("测试集MSE:{}".format(mean_squared_error(Y_test,Y_test_pred)))

输出

训练集MSE:0.04120352010364791
测试集MSE:0.09097951734421819

测试集0.09

Y_pred_mean_xgb_important = xgb_model.predict(test_data_and_mean[:,important])
with open("result_121217_important_xgb.txt","w") as f1:
    temp = "\n".join(str(v) for v in Y_pred_mean_xgb_important.tolist())
    f1.write(temp)

分数竟然是 0.3614 .......

树深度为2

目前的主要问题是过拟合的过于明显,表现为 训练集的MSE远小于测试集的MSE

只保留标准化的数据,不保留原始数据。

xgb_params = {'learning_rate': 0.01, 'n_estimators': 700, 
              'max_depth': 2, 'min_child_weight': 1, 
              'seed': 0, 'subsample': 0.8, 'colsample_bytree': 0.8, 
              'gamma': 0, 'reg_alpha': 0, 'reg_lambda': 1}

xgb_model = xgb.XGBRegressor(**xgb_params)
xgb_model.fit(X_train_mean, Y_train)
## train准确性
Y_train_pred = xgb_model.predict(X_train_mean)
print("训练集MSE:{}".format(mean_squared_error(Y_train,Y_train_pred)))
## test准确性
Y_test_pred = xgb_model.predict(X_test_mean)
print("测试集MSE:{}".format(mean_squared_error(Y_test,Y_test_pred)))

输出

训练集MSE:0.10528870568531111
测试集MSE:0.1022618059668572

减少树的深度,希望能提高泛化能力。

Y_pred_xgb = xgb_model.predict(test_data)
with open("result_121309_xgb.txt","w") as f1:
    temp = "\n".join(str(v) for v in Y_pred_xgb.tolist())
    f1.write(temp)

0.1907 .....

无参数标准化

比较忙,随便写一个版本

xgb_model = xgb.XGBRegressor()
xgb_model.fit(X_train_mean, Y_train)
## train准确性
Y_train_pred = xgb_model.predict(X_train_mean)
print("训练集MSE:{}".format(mean_squared_error(Y_train,Y_train_pred)))
## test准确性
Y_test_pred = xgb_model.predict(X_test_mean)
print("测试集MSE:{}".format(mean_squared_error(Y_test,Y_test_pred)))

训练集MSE:0.07184510530700054
测试集MSE:0.09946644428163588

结果为: 0.1905 感觉越来越不行了呀....

Responses

From now on, bravely dream and run toward that dream.
陕ICP备17001447号·苏公网安备 32059002001895号