tensorflow beginning of the tutorial - Data set: a quick overview tf.data

Reference article: Data collection: a quick overview

Data collection: a quick overview

tf.data

  • Data read from the memory array numpy.
  • Read line by line csv file.

Basic Input

Learn how to get an array of fragments, start learning tf.data is the easiest way.

Premade Estimators

def train_input_fn(features, labels, batch_size):
    """一个用来训练的输入函数"""
    # 将输入值转化为数据集。
    dataset = tf.data.Dataset.from_tensor_slices((dict(features), labels))

    # 混排、重复、批处理样本。
    dataset = dataset.shuffle(1000).repeat().batch(batch_size)

    # 返回数据集
    return dataset

Let's do a more detailed analysis of this function.

parameter

This function takes three parameters altogether. If a parameter is expected type "array" (array), then it can accept almost all numpy.array can be transformed into an array of values. We can see that with one exception: tuple, it Datasets has a special meaning.

  • features: a form as { 'feature_name': array} data dictionary (or DataFrame), which contains the original input feature.
  • labels: an array comprising a label for each sample.
  • batch_size: integer indicating a batch size needed.

In premade_estimator.py, we use iris_data.load_data () function to retrieve the iris data.
You can run this function, and as such decompression results:

import iris_data

# 获取数据
train, test = iris_data.load_data()
features, labels = train

Followed by a line of code like the following, the input data is passed to the function:

batch_size=100
iris_data.train_input_fn(features, labels, batch_size)

Let's look at the specific train_input_fn () function.

(Array) fragment

TF Layers Tutorial: Building a convolution neural network

Dataset returns the code as follows:

train, test = tf.keras.datasets.mnist.load_data()
mnist_x, mnist_y = train

mnist_ds = tf.data.Dataset.from_tensor_slices(mnist_x)
print(mnist_ds)

Tensor

<TensorSliceDataset shapes: (28,28), types: tf.uint8>

Dataset above represents a simple set of arrays, but the data set is more complex than that. Dataset can transparently handle any combination of nested dictionaries or tuples (or namedtuple).

For example, after converting the standard features of irls python dictionary, you can convert Dataset dictionary dictionary array, as follows:

dataset = tf.data.Dataset.from_tensor_slices(dict(features))
print(dataset)
<TensorSliceDataset

  shapes: {
    SepalLength: (), PetalWidth: (),
    PetalLength: (), SepalWidth: ()},

  types: {
      SepalLength: tf.float64, PetalWidth: tf.float64,
      PetalLength: tf.float64, SepalWidth: tf.float64}
>

Tensor

The first row of the iris train_input_fn same functionality, but adds a layer structure. It creates a contained (features_dict, label) for the data set.

The following code shows that the tag is of type int64 scalar:

# 将输入转化为数据集。
dataset = tf.data.Dataset.from_tensor_slices((dict(features), labels))
print(dataset)
<TensorSliceDataset
    shapes: (
        {
          SepalLength: (), PetalWidth: (),
          PetalLength: (), SepalWidth: ()},
        ()),

    types: (
        {
          SepalLength: tf.float64, PetalWidth: tf.float64,
          PetalLength: tf.float64, SepalWidth: tf.float64},
        tf.int64)>

operating

Currently, a Dataset will traverse a fixed sequence of data, and can generate a time element. Before it can be used for training, it requires further processing. Fortunately, tf.data.Dataset class provides a way to get the data to be better prepared for the training. train_input_fn the next line on the use of several such methods:

# 样本的混排、重复、批处理。
dataset = dataset.shuffle(1000).repeat().batch(batch_size)

tf.data.Dataset.shuffle

tf.data.Dataset.repeat

tf.data.Dataset.batch

print(mnist_ds.batch(100))
<BatchDataset
  shapes: (?, 28, 28),
  types: tf.uint8>

Note that since the last batch there will be relatively few elements, so the batch size of the data set is uncertain.

In train_input_fn, after which a batch, comprising a set of data elements are a one-dimensional vector, the front portions of these one-dimensional vector are:

print(dataset)
<TensorSliceDataset
    shapes: (
        {
          SepalLength: (?,), PetalWidth: (?,),
          PetalLength: (?,), SepalWidth: (?,)},
        (?,)),

    types: (
        {
          SepalLength: tf.float64, PetalWidth: tf.float64,
          PetalLength: tf.float64, SepalWidth: tf.float64},
        tf.int64)>

return

At this time, Dataset comprising (features_dict, labels) pair. This method is a train and evaluate the desired format, the data set is returned input_fn.

In use predict method, / labels should be omitted.

Read CSV files

tf.data

Iris_data.maybe_download following a call to the function, the data will be downloaded when necessary, and returns the result file path:

import iris_data
train_path, test_path = iris_data.maybe_download()

iris_data.csv_input_fn function includes an alternative csv file parsing with Dataset.

Let's look at how to build a compatible Estimator, the input function can be read local files.

Establish Dataset

tf.data.Dataset.skip

ds = tf.data.TextLineDataset(train_path).skip(1)

The establishment of a csv line parser

We start function can resolve a row from the establishment.

tf.decode_csv

tf.decode_csv

# 描述文本列的元数据
COLUMNS = ['SepalLength', 'SepalWidth',
           'PetalLength', 'PetalWidth',
           'label']
FIELD_DEFAULTS = [[0.0], [0.0], [0.0], [0.0], [0]]
def _parse_line(line):
    # 将行解码到 fields 中
    fields = tf.decode_csv(line, FIELD_DEFAULTS)

    # 将结果打包成字典
    features = dict(zip(COLUMNS,fields))

    # 将标签从特征中分离
    label = features.pop('label')

    return features, label

Parse multiple lines

tf.data.Dataset.map

This map map_func method accepts a parameter, which describes the Dataset how each element is converted.
Here Insert Picture Description
tf.data.Dataset.map

Therefore, in order to resolve them when multiple rows of data to be read from csv file out, we offer _parse_line function map method:

ds = ds.map(_parse_line)
print(ds)
<MapDataset
shapes: (
    {SepalLength: (), PetalWidth: (), ...},
    ()),
types: (
    {SepalLength: tf.float32, PetalWidth: tf.float32, ...},
    tf.int32)>

Now, the data set comprising (features, label) data, rather than a simple scalar string.

iris_data.csv_input_fn remainder and functions Basic input in the same iris_data.train_input_fn functions described.

practice

This function can be used as a substitute iris_data.train_input_fn. It can be like the following manner, to provide data to the estimator:

train_path, test_path = iris_data.maybe_download()

# 所有的输入都是数字
feature_columns = [
    tf.feature_column.numeric_column(name)
    for name in iris_data.CSV_COLUMN_NAMES[:-1]]

# 构建 estimator
est = tf.estimator.LinearClassifier(feature_columns,
                                    n_classes=3)
# 训练 estimator
batch_size = 100
est.train(
    steps=1000,
    input_fn=lambda : iris_data.csv_input_fn(train_path, batch_size))

Estimator expect input_fn no parameters. To remove this restriction, we use lambda to capture parameters and provides the expected interface.

to sum up

For convenience the data from different sources of the read data, tf.data module provides a collection of classes and functions. In addition, tf.data simple and powerful way to use a variety of standard and custom conversion.

Now that you have a basic understanding of how to efficiently get the data for the Estimator. (As an extension) Then you can think about the following documents:

Published 800 original articles · won praise 39 · views 120 000 +

Guess you like

Origin blog.csdn.net/Dontla/article/details/104262089