tensorflow 特征工程:feature_column 及对应解析方法

特征工程方法 feature_column

简介

作为一个合格的TFboy, 不仅需要熟练使用低级API。

还需要会用estimator这类高级API

高级API具有封装良好、自动保存checkpoint、自动保存模型、自动设置Adagrad优化的有点。

还可以直接使用feature_column进行输入数据特征处理。

离散数据

离散数据处理方式有:

  • one_hot
  • categorical_column_with_hash_bucket 装桶
  • crossed_column 交叉装桶

one_hot

feature_column.indicator_column

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
color_data = {'color': [ "R", "B", 'G', 'A', 'A']}  # 4行样本
builder = _LazyBuilder(color_data)
color_column = feature_column.categorical_column_with_vocabulary_list(
'color', ['R', 'G', 'B'], dtype=tf.string, default_value=-1
)
color_column_tensor = color_column._get_sparse_tensors(builder)

# 将稀疏的转换成dense,也就是one-hot形式,只是multi-hot
color_column_identy = feature_column.indicator_column(color_column)
color_dense_tensor = feature_column.input_layer(color_data, [color_column_identy])

with tf.Session() as session:
session.run(tf.global_variables_initializer())
session.run(tf.tables_initializer())
print(session.run([color_column_tensor.id_tensor]))
print('use input_layer' + '_' * 40)
print(session.run([color_dense_tensor]))

输出
1
2
3
4
5
6
7
8
9
10
11
[SparseTensorValue(indices=array([[0, 0],
[1, 0],
[2, 0],
[3, 0],
[4, 0]]), values=array([ 0, 2, 1, -1, -1]), dense_shape=array([5, 1]))]
use input_layer________________________________________
[array([[1., 0., 0.],
[0., 0., 1.],
[0., 1., 0.],
[0., 0., 0.],
[0., 0., 0.]], dtype=float32)]

categorical_column_with_identity

有些数据比如ID,看起来是数字,其实是离散数据的。

1
2
3
4
5
6
7
8
9
color_data = {'color': [1, 2, 3, 5,1]}
builder = _LazyBuilder(color_data)
color_column = feature_column.categorical_column_with_identity(
key='color', num_buckets=4, default_value=0)
color_column_tensor = color_column._get_sparse_tensors(builder)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
sess.run(tf.tables_initializer())
print(sess.run(color_column_tensor.id_tensor))

输出,num_buckets指的是种类,default_value是默认值,如果超出了范围。

1
2
3
4
5
SparseTensorValue(indices=array([[0, 0],
[1, 0],
[2, 0],
[3, 0],
[4, 0]]), values=array([1, 2, 3, 0, 1]), dense_shape=array([5, 1]))

categorical_column_with_hash_bucket

如果种类数目特别多,但是大多数都不怎么用,甚至大多数都不用的情况,建桶会比较方便。

1
2
dpt =  tf.feature_column.categorical_column_with_hash_bucket(
'dpt', hash_bucket_size=400)

映射规则可以通过 string_to_hash_bucket_fast 查看
1
sess.run(tf.string_to_hash_bucket_fast(dpt,400))

https://www.tensorflow.org/api_docs/python/tf/feature_column/categorical_column_with_hash_bucket

crossed_column

1
tf.feature_column.crossed_column(["dpt","arr"],hash_bucket_size = 100000)

映射规则可以通过input_layer查看

1
2
3
4
5
6
7
8
9
10
11
12
13
14
def get_hash(num1, num2):
arr_temp = {arr: tf.Variable(cross_arr[num1:num2], tf.string)}
dpt_temp = {dpt: tf.Variable(cross_dpt[num1:num2], tf.string)}
arr_x_dpt = dict(arr_temp, **dpt_temp)
# crossed column
crossed_sn_raw = tf.feature_column.crossed_column([arr, dpt],hash_bucket_size=self.crossed_column_bucket_size[i])
crossed_sn = tf.feature_column.indicator_column(crossed_sn_raw)
layer_sn = tf.feature_column.input_layer(arr_x_dpt, crossed_sn)

with tf.Session() as session:
init = tf.global_variables_initializer()
session.run(init)
res = session.run(layer_sn).argmax(axis=1)
return res

连续数据

连续数据有:

直接用

tf.feature_column.numeric_column("age")

CDF分桶用

1
2
age = tf.feature_column.bucketized_column \
(age ,boundaries = [1.0,10.0,50.0,75.0])

一般来说,边界通过:

1
2
3
thresholds = []
percentiles = np.linspace(100/slice_num, 100-100/slice_num, slice_num-1)
thresholds_raw = np.percentile(np.array(f_values), percentiles, interpolation='lower')

确定

参考

tensorflow 特征工程:feature_column 及对应解析方法

https://iii.run/archives/a023a02dcb7c.html

作者

mmmwhy

发布于

2018-11-05

更新于

2022-10-08

许可协议

评论