tensorflow 特征工程: feature_column 及对应解析方法
in Tensorflow with 0 comment

tensorflow 特征工程: feature_column 及对应解析方法

in Tensorflow with 0 comment
特征工程方法 feature_column

简介

作为一个合格的TFboy, 不仅需要熟练使用低级API。

还需要会用estimator这类高级API

高级API具有封装良好、自动保存checkpoint、自动保存模型、自动设置Adagrad优化的有点。

还可以直接使用feature_column进行输入数据特征处理。

离散数据

离散数据处理方式有:

one_hot

feature_column.indicator_column

color_data = {'color': [ "R", "B", 'G', 'A', 'A']}  # 4行样本
builder = _LazyBuilder(color_data)
color_column = feature_column.categorical_column_with_vocabulary_list(
    'color', ['R', 'G', 'B'], dtype=tf.string, default_value=-1
)
color_column_tensor = color_column._get_sparse_tensors(builder)

# 将稀疏的转换成dense,也就是one-hot形式,只是multi-hot
color_column_identy = feature_column.indicator_column(color_column)
color_dense_tensor = feature_column.input_layer(color_data, [color_column_identy])

with tf.Session() as session:
    session.run(tf.global_variables_initializer())
    session.run(tf.tables_initializer())
    print(session.run([color_column_tensor.id_tensor]))
    print('use input_layer' + '_' * 40)
    print(session.run([color_dense_tensor]))

输出

[SparseTensorValue(indices=array([[0, 0],
       [1, 0],
       [2, 0],
       [3, 0],
       [4, 0]]), values=array([ 0,  2,  1, -1, -1]), dense_shape=array([5, 1]))]
use input_layer________________________________________
[array([[1., 0., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 0., 0.],
       [0., 0., 0.]], dtype=float32)]

categorical_column_with_identity

有些数据比如ID,看起来是数字,其实是离散数据的。

color_data = {'color': [1, 2, 3, 5,1]}
builder = _LazyBuilder(color_data)
color_column = feature_column.categorical_column_with_identity(
    key='color', num_buckets=4, default_value=0)
color_column_tensor = color_column._get_sparse_tensors(builder)
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    sess.run(tf.tables_initializer())
    print(sess.run(color_column_tensor.id_tensor))

输出,num_buckets指的是种类,default_value是默认值,如果超出了范围。

SparseTensorValue(indices=array([[0, 0],
       [1, 0],
       [2, 0],
       [3, 0],
       [4, 0]]), values=array([1, 2, 3, 0, 1]), dense_shape=array([5, 1]))

categorical_column_with_hash_bucket

如果种类数目特别多,但是大多数都不怎么用,甚至大多数都不用的情况,建桶会比较方便。

dpt =  tf.feature_column.categorical_column_with_hash_bucket(
      'dpt', hash_bucket_size=400)

映射规则可以通过 string_to_hash_bucket_fast 查看

sess.run(tf.string_to_hash_bucket_fast(dpt,400))

https://www.tensorflow.org/api_docs/python/tf/feature_column/categorical_column_with_hash_bucket

crossed_column

tf.feature_column.crossed_column(["dpt","arr"],hash_bucket_size = 100000)

映射规则可以通过input_layer查看

def get_hash(num1, num2):
    arr_temp = {arr: tf.Variable(cross_arr[num1:num2], tf.string)}
    dpt_temp = {dpt: tf.Variable(cross_dpt[num1:num2], tf.string)}
    arr_x_dpt = dict(arr_temp, **dpt_temp)
    # crossed column
    crossed_sn_raw = tf.feature_column.crossed_column([arr, dpt],hash_bucket_size=self.crossed_column_bucket_size[i])
    crossed_sn = tf.feature_column.indicator_column(crossed_sn_raw)
    layer_sn = tf.feature_column.input_layer(arr_x_dpt, crossed_sn)

    with tf.Session() as session:
        init = tf.global_variables_initializer()
        session.run(init)
        res = session.run(layer_sn).argmax(axis=1)
    return res

连续数据

连续数据有:

直接用

tf.feature_column.numeric_column("age")

CDF分桶用

age = tf.feature_column.bucketized_column \
    (age ,boundaries = [1.0,10.0,50.0,75.0])

一般来说,边界通过:

thresholds = []
percentiles = np.linspace(100/slice_num, 100-100/slice_num, slice_num-1)
thresholds_raw = np.percentile(np.array(f_values), percentiles, interpolation='lower')

确定

参考

Responses

From now on, bravely dream and run toward that dream.
陕ICP备17001447号