Performance Guide

Performance Guide

本篇主要讲述：优化 TensorFlow 代码的一些方法。本篇将分为以下几部分来讲：

通用的一些优化技术 over 不同类型模型及硬件
针对 GPU 的一些优化技术
针对 CPU 的一些优化技术

1. 通用的一些优化技术

这一部分主要讲述一些通用技术（可以用于不同类型的模型及硬件）。这些技术将被拆分为以下几部分：

输入管道优化（Input pipeline optimizations）
数据格式（Data formats）
常用的融合op（Common fused Ops）
RNN性能（RNN Performance）
从源码构建、安装TensorFlow（Building and installing from source）

1.1 输入管道优化

一般模型都是从硬盘读取并预处理数据，然后将数据传递给模型。例如，模型处理 JPEG 图片的流程一般是：从硬盘加载image，将其解码成一个 Tensor，随机裁剪、填充（可能也会进行随机翻转，distort），然后batch。这个流程成为输入管道。当GPUs和其他硬件使得模型越来越快时，数据的预处理可能成为一个瓶颈。

确定输入管道是否是瓶颈很复杂。一个最简单的方法是将输入管道后的模型削减为一个单一的运算（很小的模型），并且衡量每秒可以处理的 example 数。如果削减前后，每秒处理的 example 数相差很小，那么输入管道很可能就是瓶颈。下面是一些其他方法（来确定是否输入管道是瓶颈）：

通过CMD下运行 nvidia-smi -l 2 来查看 GPU 的使用情况。如果 GPU 的使用率低于 80%，那么输入管道可能是瓶颈。
生成一个 timeline，并且查看是否有较长的等待时间（不计算，好像卡住了一样）。在 XLA / JIT 教程中示例了，怎么产生时间线（Generate a timeline and look for large blocks of white space (waiting). An example of generating a timeline exists as part of the XLA JIT tutorial）。
检查 CPU 的使用率。注意：很可能有一个优化后的输入管道并且缺少 CPU 时间来处理输入管道。（It is possible to have an optimized input pipeline and lack the CPU cycles to process the pipeline）
估计需要的吞吐量，并且确认使用的硬盘的读写速度是否是足够。注意：一些云计算平台的网络硬盘的速度低至 50 MB/sec，这个速度比旋转式机械硬盘（150 MB/sec）、SATA SSDs（500 MB/sec）以及 PCIe SSDs（2000+ MB/sec）的速度要慢。

1.1.1 在 CPU 上进行数据预处理

将输入管道的运算放在 CPU 上能够显著地提高性能。利用 CPU 来进行输入管道的运算使得 GPU 能够专注于训练。为了确保在 CPU 上进行数据预处理，请像下面这样对预处理运算进行包装：

with tf.device('/cpu:0'):
  # function to get and process images or data.
  distorted_inputs = load_and_distort_images()

如果使用 tf.estimator.Estimator，则 Estimator 的输入函数会自动被放在 CPU 上。

1.2 使用 `tf.data` API

tf.data API 将替代 queue_runner 成为官方推荐的输入管道构建 API。CIFAR-10 数据集上的 ResNet 模型 (arXiv:1512.03385)演示了 tf.data API 和 tf.estimator.Estimator 的使用。

tf.data API 使用的是 C++ 的多线程，而基于 Python 的 queue_runner 的性能受限于 Python 的多线层能力，所以 tf.data 有着更好的性能。关于 tf.data API 的详细的性能指南见《Input Pipeline Performance Guide》

feed_dict 提供了很好的灵活性，但是 feed_dict 很难扩展。只使用单个 GPU 时，tf.data API 和 feed_dict 之间的性能差异可以忽略不计。除小数据集外，官方建议避免使用 feed_dict。在大数据集的情况下，尤其要比买你使用 feed_dict。

# feed_dict often results in suboptimal performance when using large inputs.
sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys}

1.3 解码裁剪运算

如果输入是 JPEG 图片，并且需要裁剪，请使用融合op：tf.image.decode_and_crop_jpeg 来加速预处理。tf.iamge.decode_and_crop_jpeg 只解码在裁剪框以内的图像。如果裁剪窗比图片小很多，这显著地加速了预处理。对于 ImageNet 数据集，这个方法能将输入管道最高加速30%。

用法示例：

def _image_preprocess_fn(image_buffer):
    # image_buffer 1-D string Tensor representing the raw JPEG image buffer.

    # Extract image shape from raw JPEG image buffer.
    image_shape = tf.image.extract_jpeg_shape(image_buffer)

    # Get a crop window with distorted bounding box.
    sample_distorted_bounding_box = tf.image.sample_distorted_bounding_box(
      image_shape, ...)
    bbox_begin, bbox_size, distort_bbox = sample_distorted_bounding_box

    # Decode and crop image.
    offset_y, offset_x, _ = tf.unstack(bbox_begin)
    target_height, target_width, _ = tf.unstack(bbox_size)
    crop_window = tf.stack([offset_y, offset_x, target_height, target_width])
    cropped_image = tf.image.decode_and_crop_jpeg(image, crop_window)

tf.image.decode_and_crop_jpeg 适用于所有的平台。注意：在 Win 平台上这个加速失效，因为 Win 平台使用的是 libjpeg 库，而其他平台使用的是 libjpeg-turbo 库。

1.4 使用大文件

读取大量的小文件极大地影响了 I/O 性能。在硬件一定的情况下，获得最大的 I/O 吞吐的一个方法是：将输入数据处理成 TFRecord（每个文件大于100MB）。对于小数据集（200MB-1GB），最好的方法是直接将整个数据集加载到内存中。这里有转换的例子。

1.2 数据格式

数据格式指的是Tensor的结构。下面所述的东西都是基于表示图像的4D Tensor的。在 TensorFlow 中，4D Tensor 的结构通常由以下字幕表示：

N：表示一个batch中图像的数量
H：表示图像竖直方向（height）的像素的数量
W：表示图像水平方向（width）的像素的数量
C：表示通道数。例如，灰度图像的通道数为1，RGB图像的通道数为3。

在 TensorFlow 中，有两种常用的数据格式：

NCHW 或 channels_first
NHWC 或 channels_last

NHWC 是 TensorFlow 的默认数据格式，NCHW 是 NVIDIA GPU 及 cuDNN 默认的数据格式（N卡GPU和cuDNN使用NCHW，计算的更快）。

最好的方法是同时使用两种数据格式去建立模型。这简化了在 GPU 上训练模型，然后在 CPU 上进行推理。如果 TensorFlow 使用 Intel MKL 进行编译优化，很多op，尤其是CNN相关的一些op，将被优化，并且支持 NCHW。如果不使用 MKL，在 NCHW 格式下，很多op无法在CPU上使用。

NHWC 在CPU上运行的更快一点。在很长一段时间，我们使用一些工具来在两种格式之间转换，以利用 GPU 在训练中的高效，CPU 在推理中的速度。

1.3 常用的融合op

融合 op 将多个运算组合成一个单一的运算来提高性能。在TensorFlow 中有很多融合 op，当可能自动提高性能时，XLA 将会创建融合op。下面是一些能够极大地提高性能，同时可能被忽略的融合op。

1.3.1 融合 batch norm

BN 是一个计算量很大的op。使用融合 batch norm 能够产生12-30% 的速度提升。

主要有两个常用的 batch norms，并且这两个都支持融合。

从TensorFlow 1.3开始，tf.layers.batch_normalization 支持融合。

bn = tf.layers.batch_normalization(
    input_layer, fused=True, data_format='NCHW')

从TensorFlow 1.0开始，tf.contrib.layers.batch_norm 支持融合。

bn = tf.contrib.layers.batch_norm(input_layer, fused=True, data_format='NCHW')

1.4 RNN Performance

tf.nn.rnn_cell.BasicLSTMCell 应该被最后考虑使用。

当使用一般的 RNN 单元时，你可以选择是否使用 tf.nn.static或者tf.nn.dynamic_rnn。这对性能没有影响，但是tf.nn.static_rnn会增大计算图的尺寸，从而导致更长的编译时间。使用tf.nn.dynamic_rnn的另一个优势是它能够将内存从 GPU swap 到 CPU，从而可以训练非常长的序列。基于模型和硬件的配置，这可能会带来一个性能损失。有可能的话，在tf.while_loop中并行地运行多个tf.nn.dynamic_rnn，这在RNN中几乎没用，因为它们本来是序列的。

在NVIDIA GPUs上，如果不需要layer normalization，请优先选择使用tf.contrib.cudnn_rnn（不支持 layer normalization）。它通常比tf.contrib.rnn.BasicLSTMCell 及 tf.contrib.rnn.LSTMBlockCell 最少快一个数量级，并且使用的内存比 tf.contrib.rnn.BasicLSTMCell 少3-4倍。

如果你需要一次运行 RNN 一个 step（这可能在强化学习中出现），那么你应该使用 tf.contrib.rnn.LSTMBlockCell，并在一个 tf.while_loop 中构建你自己的环境交互循环。一次只运行RNN的一个step，并且将结果返回到python是可以的，但是这很慢。

在 CPU、移动设备环境下，如果 tf.contrib.cudnn_rnn 在你的GPU上不可用，最快并且内存占用少的op是 tf.contrib.rnn.LSTMBlockFusedCell。

对于不常见的RNN cell类型（比如：tf.contrib.rnn.NASCell、tf.contrib.rnn.PhasedLSTMCell，tf.contrib.rnn.UGRNNCell，tf.contrib.rnn.GLSTMCell，tf.contrib.rnn.Conv1DLSTMCell，tf.contrib.rnn.Conv2DLSTMCell，tf.contrib.rnn.LayerNormBasicLSTMCell等），我们应该意识到它们在计算图中，像tf.contrib.rnn.BasicLSTMCell 一样，性能低，并且内存占用高。我们在使用这些单元前，需要考虑这样的平衡是否值得。例如，虽然 layer normalization 能够加速收敛速度，但在不使用layer normalization的情况下，cuDNN 能够加速20倍。

1.5 从源码构建、安装TensorFlow

TensorFlow的二进制预编译文件没有使用全部的优化技术。如果你使用 GPU来训练或推理，推荐你自己编译TensorFlow（打开硬件支持的所有的优化）。CPU上训练和推理的加速在下面进行了描述
Speedups for training and inference on CPU are documented below in Comparing compiler optimizations.

为了安装最优的TensorFlow版本，请从源码构建、安装。必要时，请使用交叉编译。下面的命令是使用 bazel 进行交叉编译的例子：

# This command optimizes for Intel’s Broadwell processor
bazel build -c opt --copt=-march="broadwell" --config=cuda //tensorflow/tools/pip_package:build_pip_package

2. 针对 GPU 的一些优化技术

本部分将讲述一些针对 GPU 设备的优化技术。在多个 GPU 上获得最优的性能有一定难度。常见的做法是使用数据并行技术。通过数据并行技术来缩放模型，涉及到将模型复制多个副本，这些副本称为 “towers”，然后在每一个GPU上放置一个 tower。每一个 tower 接收不同的 mini-batch 并且根据这个mini-batch 更新参数，这需要所有的 tower 共享参数。每个 tower 如何得到更新的变量以及如何应用梯度对模型的性能、缩放和收敛有影响。本节的其余部分概述了多个GPU上的参数的放置和模型的 towering。高性能模型会涉及到更复杂的方法，这些方法可以用来共享和更新 tower 之间的变量。

处理参数更新的最好方法取决于模型、硬件以及每个硬件的配置情况。例如，硬件都是 NVIDIA Tesla P100s，一种连接方法是用 PCIe 总线连接，另一种方法是使用 NVLink 连接。在这种情况下，两种连接可能导致不一样的最优方案。对于实际情况，可以参照 benchmark 页面各种平台上的最优方案。下面是对各种平台和配置下的基准测试的总结：

Tesla K80：如果 GPU 连接在同一个 PCIe 总线上，并且能够使用 NVIDIA GPUDirect 将不同 GPU 彼此连接在一起，那么就将参数均等地分配给多个 GPU 是最优方案。如果 GPU 不使用 GPUDirect 彼此互联，那么将参数防止在 CPU 上是最优方案。
Titan X (Maxwell and Pascal), M40, P100, and similar：对于 ResNet、Inception V3 这样的模型，将参数放置在 CPU 上是最优方案；但是对于AlexNet、VGG 等参数量巨大的模型，使用 GPU with NCCL 能更好一点。

管理参数放置位置的一个常用方法是：创建一个方法来确定每一个op要放置到哪个设备上，并且使用 with tf.device(): 来实现放置。考虑这样一个场景：一个模型使用2块GPU来训练，并且参数被放置在CPU上。有一个 loop 来创建towers，并将其放置在两个GPU上。A custom device placement method would be created that watches for Ops of type Variable, VariableV2, and VarHandleOp and indicates that they are to be placed on the CPU. All other Ops would be placed on the target GPU. The building of the graph would proceed as follows:

在第一个 loop 中，为 gpu:0 创建模型的一个 tower。在放置 op 的过程中，自定义设备指定方法将表明变量放置在 cpu:0 上，其它的 op 放置在 gpu:0 上。
在第二个 loop 中，reuse 被设置为 True 指明变量被重用，然后在 gpu:1 上创建 tower。在放置与 tower 有关的 op 的过程中，放置在 cpu:0 上的变量被重用，并且被创建的所有其它 op 被放置在 gpu:1上。

最终的结果是所有的变量都放在 CPU 上，每个 GPU 都具有与模型相关联的所有计算OPS的副本。

下面的代码片段说明了两种不同的变量放置方法：1. 将变量放置在 CPU 上；2. 将变量均匀分布于各个 GPU。

class GpuParamServerDeviceSetter(object):
  """Used with tf.device() to place variables on the least loaded GPU.

    A common use for this class is to pass a list of GPU devices, e.g. ['gpu:0',
    'gpu:1','gpu:2'], as ps_devices.  When each variable is placed, it will be
    placed on the least loaded gpu. All other Ops, which will be the computation
    Ops, will be placed on the worker_device.
  """

  def __init__(self, worker_device, ps_devices):
    """Initializer for GpuParamServerDeviceSetter.
    Args:
      worker_device: the device to use for computation Ops.
      ps_devices: a list of devices to use for Variable Ops. Each variable is
      assigned to the least loaded device.
    """
    self.ps_devices = ps_devices
    self.worker_device = worker_device
    self.ps_sizes = [0] * len(self.ps_devices)

  def __call__(self, op):
    if op.device:
      return op.device
    if op.type not in ['Variable', 'VariableV2', 'VarHandleOp']:
      return self.worker_device

    # Gets the least loaded ps_device
    device_index, _ = min(enumerate(self.ps_sizes), key=operator.itemgetter(1))
    device_name = self.ps_devices[device_index]
    var_size = op.outputs[0].get_shape().num_elements()
    self.ps_sizes[device_index] += var_size

    return device_name

def _create_device_setter(is_cpu_ps, worker, num_gpus):
  """Create device setter object."""
  if is_cpu_ps:
    # tf.train.replica_device_setter supports placing variables on the CPU, all
    # on one GPU, or on ps_servers defined in a cluster_spec.
    return tf.train.replica_device_setter(
        worker_device=worker, ps_device='/cpu:0', ps_tasks=1)
  else:
    gpus = ['/gpu:%d' % i for i in range(num_gpus)]
    return ParamServerDeviceSetter(worker, gpus)

# The method below is a modified snippet from the full example.
def _resnet_model_fn():
    # When set to False, variables are placed on the least loaded GPU. If set
    # to True, the variables will be placed on the CPU.
    is_cpu_ps = False

    # Loops over the number of GPUs and creates a copy ("tower") of the model on
    # each GPU.
    for i in range(num_gpus):
      worker = '/gpu:%d' % i
      # Creates a device setter used to determine where Ops are to be placed.
      device_setter = _create_device_setter(is_cpu_ps, worker, FLAGS.num_gpus)
      # Creates variables on the first loop.  On subsequent loops reuse is set
      # to True, which results in the "towers" sharing variables.
      with tf.variable_scope('resnet', reuse=bool(i != 0)):
        with tf.name_scope('tower_%d' % i) as name_scope:
          # tf.device calls the device_setter for each Op that is created.
          # device_setter returns the device the Op is to be placed on.
          with tf.device(device_setter):
            # Creates the "tower".
            _tower_fn(is_training, weight_decay, tower_features[i],
                      tower_labels[i], tower_losses, tower_gradvars,
                      tower_preds, False)

在不久的将来，上述代码将仅用于示例，因为将很容易使用高阶方法支持各种流行的方法。这个例子将随着 API 的扩充和演变而不断更新，最终高阶 API 将解决多 GPU 的场景。

3. 针对 CPU 的一些优化技术

TensorFlow 针对当前平台的 CPU 从源码编译安装，CPU 才能够达到最优性能。

除了使用最新的指令集，Intel 已经在 Intel® MKL-DNN 里对 TensorFlow 里的 DNN 添加了支持。

下面列出了通过调整线程池来优化 CPU 性能的两种方案：
- intra_op_parallelism_threads：这个池中包含很多独立的节点，这些节点能够使用多线程并行执行。
- inter_op_parallelism_threads：这个池中包含所有的节点

这些配置可以通过 tf.ConfigProto来配置，然后将其传给 tf.Session（如下面的代码所示）。对于两种配置，，如果不设置它们，或者设置为 0，默认值将为处理的核心数。测试已经表明：默认值对于逻辑核心数从4到70+的处理器都是高效的。另一种常用的优化方法是设置两个线程池的数量等于处理器的核心数，而不是逻辑核心数。

  config = tf.ConfigProto()
  config.intra_op_parallelism_threads = 44
  config.inter_op_parallelism_threads = 44
  tf.session(config=config)

TensorFlow 性能优化之 Performance Guide

Performance Guide

1. 通用的一些优化技术

1.1 输入管道优化

1.1.1 在 CPU 上进行数据预处理

1.2 使用 `tf.data` API

1.3 解码裁剪运算

1.4 使用大文件

1.2 数据格式

1.3 常用的融合op

1.3.1 融合 batch norm

1.4 RNN Performance

1.5 从源码构建、安装TensorFlow

2. 针对 GPU 的一些优化技术

3. 针对 CPU 的一些优化技术

猜你喜欢

TensorFlow 性能优化之 Performance Guide

Performance Guide

1. 通用的一些优化技术

1.1 输入管道优化

1.1.1 在 CPU 上进行数据预处理

1.2 使用 tf.data API

1.3 解码裁剪运算

1.4 使用大文件

1.2 数据格式

1.3 常用的融合op

1.3.1 融合 batch norm

1.4 RNN Performance

1.5 从源码构建、安装TensorFlow

2. 针对 GPU 的一些优化技术

3. 针对 CPU 的一些优化技术

猜你喜欢

1.2 使用 `tf.data` API