Reference article: Data collection: a quick overview
Data collection: a quick overview
- Data read from the memory array numpy.
- Read line by line csv file.
Basic Input
Learn how to get an array of fragments, start learning tf.data is the easiest way.
def train_input_fn(features, labels, batch_size):
"""一个用来训练的输入函数"""
# 将输入值转化为数据集。
dataset = tf.data.Dataset.from_tensor_slices((dict(features), labels))
# 混排、重复、批处理样本。
dataset = dataset.shuffle(1000).repeat().batch(batch_size)
# 返回数据集
return dataset
Let's do a more detailed analysis of this function.
parameter
This function takes three parameters altogether. If a parameter is expected type "array" (array), then it can accept almost all numpy.array can be transformed into an array of values. We can see that with one exception: tuple, it Datasets has a special meaning.
- features: a form as { 'feature_name': array} data dictionary (or DataFrame), which contains the original input feature.
- labels: an array comprising a label for each sample.
- batch_size: integer indicating a batch size needed.
In premade_estimator.py, we use iris_data.load_data () function to retrieve the iris data.
You can run this function, and as such decompression results:
import iris_data
# 获取数据
train, test = iris_data.load_data()
features, labels = train
Followed by a line of code like the following, the input data is passed to the function:
batch_size=100
iris_data.train_input_fn(features, labels, batch_size)
Let's look at the specific train_input_fn () function.
(Array) fragment
TF Layers Tutorial: Building a convolution neural network
Dataset returns the code as follows:
train, test = tf.keras.datasets.mnist.load_data()
mnist_x, mnist_y = train
mnist_ds = tf.data.Dataset.from_tensor_slices(mnist_x)
print(mnist_ds)
<TensorSliceDataset shapes: (28,28), types: tf.uint8>
Dataset above represents a simple set of arrays, but the data set is more complex than that. Dataset can transparently handle any combination of nested dictionaries or tuples (or namedtuple).
For example, after converting the standard features of irls python dictionary, you can convert Dataset dictionary dictionary array, as follows:
dataset = tf.data.Dataset.from_tensor_slices(dict(features))
print(dataset)
<TensorSliceDataset
shapes: {
SepalLength: (), PetalWidth: (),
PetalLength: (), SepalWidth: ()},
types: {
SepalLength: tf.float64, PetalWidth: tf.float64,
PetalLength: tf.float64, SepalWidth: tf.float64}
>
The first row of the iris train_input_fn same functionality, but adds a layer structure. It creates a contained (features_dict, label) for the data set.
The following code shows that the tag is of type int64 scalar:
# 将输入转化为数据集。
dataset = tf.data.Dataset.from_tensor_slices((dict(features), labels))
print(dataset)
<TensorSliceDataset
shapes: (
{
SepalLength: (), PetalWidth: (),
PetalLength: (), SepalWidth: ()},
()),
types: (
{
SepalLength: tf.float64, PetalWidth: tf.float64,
PetalLength: tf.float64, SepalWidth: tf.float64},
tf.int64)>
operating
Currently, a Dataset will traverse a fixed sequence of data, and can generate a time element. Before it can be used for training, it requires further processing. Fortunately, tf.data.Dataset class provides a way to get the data to be better prepared for the training. train_input_fn the next line on the use of several such methods:
# 样本的混排、重复、批处理。
dataset = dataset.shuffle(1000).repeat().batch(batch_size)
print(mnist_ds.batch(100))
<BatchDataset
shapes: (?, 28, 28),
types: tf.uint8>
Note that since the last batch there will be relatively few elements, so the batch size of the data set is uncertain.
In train_input_fn, after which a batch, comprising a set of data elements are a one-dimensional vector, the front portions of these one-dimensional vector are:
print(dataset)
<TensorSliceDataset
shapes: (
{
SepalLength: (?,), PetalWidth: (?,),
PetalLength: (?,), SepalWidth: (?,)},
(?,)),
types: (
{
SepalLength: tf.float64, PetalWidth: tf.float64,
PetalLength: tf.float64, SepalWidth: tf.float64},
tf.int64)>
return
At this time, Dataset comprising (features_dict, labels) pair. This method is a train and evaluate the desired format, the data set is returned input_fn.
In use predict method, / labels should be omitted.
Read CSV files
Iris_data.maybe_download following a call to the function, the data will be downloaded when necessary, and returns the result file path:
import iris_data
train_path, test_path = iris_data.maybe_download()
iris_data.csv_input_fn function includes an alternative csv file parsing with Dataset.
Let's look at how to build a compatible Estimator, the input function can be read local files.
Establish Dataset
ds = tf.data.TextLineDataset(train_path).skip(1)
The establishment of a csv line parser
We start function can resolve a row from the establishment.
# 描述文本列的元数据
COLUMNS = ['SepalLength', 'SepalWidth',
'PetalLength', 'PetalWidth',
'label']
FIELD_DEFAULTS = [[0.0], [0.0], [0.0], [0.0], [0]]
def _parse_line(line):
# 将行解码到 fields 中
fields = tf.decode_csv(line, FIELD_DEFAULTS)
# 将结果打包成字典
features = dict(zip(COLUMNS,fields))
# 将标签从特征中分离
label = features.pop('label')
return features, label
Parse multiple lines
This map map_func method accepts a parameter, which describes the Dataset how each element is converted.
tf.data.Dataset.map
Therefore, in order to resolve them when multiple rows of data to be read from csv file out, we offer _parse_line function map method:
ds = ds.map(_parse_line)
print(ds)
<MapDataset
shapes: (
{SepalLength: (), PetalWidth: (), ...},
()),
types: (
{SepalLength: tf.float32, PetalWidth: tf.float32, ...},
tf.int32)>
Now, the data set comprising (features, label) data, rather than a simple scalar string.
iris_data.csv_input_fn remainder and functions Basic input in the same iris_data.train_input_fn functions described.
practice
This function can be used as a substitute iris_data.train_input_fn. It can be like the following manner, to provide data to the estimator:
train_path, test_path = iris_data.maybe_download()
# 所有的输入都是数字
feature_columns = [
tf.feature_column.numeric_column(name)
for name in iris_data.CSV_COLUMN_NAMES[:-1]]
# 构建 estimator
est = tf.estimator.LinearClassifier(feature_columns,
n_classes=3)
# 训练 estimator
batch_size = 100
est.train(
steps=1000,
input_fn=lambda : iris_data.csv_input_fn(train_path, batch_size))
Estimator expect input_fn no parameters. To remove this restriction, we use lambda to capture parameters and provides the expected interface.
to sum up
For convenience the data from different sources of the read data, tf.data module provides a collection of classes and functions. In addition, tf.data simple and powerful way to use a variety of standard and custom conversion.
Now that you have a basic understanding of how to efficiently get the data for the Estimator. (As an extension) Then you can think about the following documents: