【TensorFlow】quantization量化

搜索关键词:quantize tensorflow

一、 Question 1:How does Tensorflow do quantization and dequantization?

Details
According to the blog post https://petewarden.com/2016/05/03/how-to-quantize-neural-networks-with-tensorflow/*(重点!)*, Tensorflow quantizes values before they go into a layer. After being processed by the layer, the values are dequantized. Tensorflow quantizes values by rescaling the values between 0 and 255, so it needs to keep “min” and “max” to dequantize the values.
I would like to ask: 1. how the “min” and “max” in the outputs of a “quantization” op are determined? I mean, if we simply find the minimum and maximum value and set them to 0 and 255, we will get data overflow or underflow when doing convolution. 2. how the “min” and “max” in the outputs of a “convolution” op are determined? Both weights and activations are quantized, so there are two sets of “min” and “max”. How does a convolution op combine them to form a single set of “min” and “max”?

根据博客文章“ https://petewarden.com/2016/05/03/how-to-quantize-neural-networks-with-tensorflow/”,Tensorflow在进入一个层之前量化值。经过图层处理后,值被去量化。Tensorflow通过重新调整0到255之间的值来量化值,所以它需要保留“min”和“max”来对这些值进行去量化。
我想问一下:1.如何确定“量化”操作的输出中的“最小”和“最大”?我的意思是,如果我们简单地找到最小值和最大值并将它们设置为0和255,那么在进行卷积时我们将会发生数据溢出或下溢。2.如何确定“卷积”op输出中的“min”和“max”?量和激活都是量化的,所以有两组“最小”和“最大”。卷积运算如何将它们组合成一组“最小”和“最大”?

Answer
TensorFlow uses i.a. gemmlowp for low-precision matrix multiplications. Although 8-bit values are used as inputs, intermediate results are 32-bit values. These 32-bit values are converted back to 8-bit before returning the results.
TensorFlow使用i.a. gemmlowp用于低精度矩阵乘法。 尽管8位值用作输入,但中间结果是32位值。 在返回结果之前,这些32位值被转换回8位。

From https://github.com/google/gemmlowp/blob/master/doc/low-precision.md :

To avoid overflow, we internally accumulate results on more than 8 bits, and at the end we keep only some significant 8 bits.
为了避免溢出,我们在内部累积了超过8位的结果,最后我们只保留了一些重要的8位。


二、How to Quantize Neural Networks with TensorFlow (Blog)官网指南

2.1 量化已有模型并做测试

代码地址:https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/quantize/python/quantize_graph.py

curl http://download.tensorflow.org/models/image/imagenet/inception-2015-12-05.tgz -o /tmp/inceptionv3.tgz
tar xzf /tmp/inceptionv3.tgz -C /tmp/
bazel build tensorflow/contrib/quantization/tools:quantize_graph
bazel-bin/tensorflow/contrib/quantization/tools/quantize_graph \
--input=/tmp/classify_image_graph_def.pb \
--output_node_names="softmax" --output=/tmp/quantized_graph.pb \
--mode=eightbit

This will produce a new model that runs the same operations as the original, but with eight bit calculations internally, and all weights quantized as well. If you look at the file size, you’ll see it’s about a quarter of the original (23MB versus 91MB). You can still run this model using exactly the same inputs and outputs though, and you should get equivalent results. Here’s an example:

bazel build tensorflow/examples/label_image:label_image
bazel-bin/tensorflow/examples/label_image/label_image \
--input_graph=/tmp/quantized_graph.pb \
--input_width=299 \
--input_height=299 \
--mean_value=128 \
--std_value=128 \
--input_layer_name="Mul:0" \
--output_layer_name="softmax:0"

2.2 量化张量使用什么表示?

我们将浮点数字数组转换为8位表示形式作为压缩问题。我们知道经过训练的神经网络模型中的权重和激活张量倾向于具有分布在相对较小范围内的值(例如,对于权重可能有-15到+15,对于图像模型上的激活可能有-500到1000)确切的数字会有所不同)。我们从实验中也知道,神经网络在噪声的情况下往往是非常稳健的,所以量化到一小组值的噪声类错误不会严重影响整体结果的精度。我们也希望选择一个易于执行计算的表示,特别是构成运行模型所需的大部分工作的大型矩阵乘法。
How Does the Quantization Process Work?

2.3 8bit运算:Eight-bit arithmetic

using gemmlowp

2.3.1 量化实现代码(重要!!)

这里写图片描述
这里写图片描述

2.3.2 原理讲解 The low-precision paradigm in gemmlowp, and how it’s implemented (gemmlowp)

“Low-precision” means that the input and output matrix entries are integers on at most 8 bits. The scalar type is uint8_t.
gemmlowp is flexible enough to support multiple low-precision paradigms, i.e. multiple ways that a meaning is attached to 8bit values so that a computation can rely on a 8bit GEMM provided by gemmlowp.

扫描二维码关注公众号,回复: 1901153 查看本文章

Building a quantization paradigm from first principles

  1. Quantization as an affine map.
  2. Domain-specific constraint: the real value 0 must be exactly representable.
  3. The final form of the quantization equation
  4. Quantizing a matrix multiplication
  5. Implementation of quantized matrix multiplication
  6. How this is implemented in gemmlowp
  7. How this differs from the older legacy gemmlowp quantization paradigm
  8. Example code illustrating the new quantization paradigm

三、GOOGLE MobileNet quantizaition实现

展示的是在training过程中插入伪量化操作
3.1 mobilenet_v1_train.py

def build_model():
  """Builds graph for model to train with rewrites for quantization.
  Returns:
    g: Graph with fake quantization ops and batch norm folding suitable for
    training quantized weights.
    train_tensor: Train op for execution during training.
  """
  g = tf.Graph()
  with g.as_default(), tf.device(
      tf.train.replica_device_setter(FLAGS.ps_tasks)):
    inputs, labels = imagenet_input(is_training=True)
    with slim.arg_scope(mobilenet_v1.mobilenet_v1_arg_scope(is_training=True)):
      logits, _ = mobilenet_v1.mobilenet_v1(
          inputs,
          is_training=True,
          depth_multiplier=FLAGS.depth_multiplier,
          num_classes=FLAGS.num_classes)

    tf.losses.softmax_cross_entropy(labels, logits)

    # Call rewriter to produce graph with fake quant ops and folded batch norms
    # quant_delay delays start of quantization till quant_delay steps, allowing
    # for better model accuracy.
    if FLAGS.quantize:
      **tf.contrib.quantize.create_training_graph(quant_delay=get_quant_delay())**

3.2 quantize_graph.py

# 创建伪量化graph
def create_training_graph(input_graph=None, quant_delay=0):
  """Rewrites a training input_graph in place for simulated quantization.
  Variables added by the rewrite get added to the global variables collection.
  The graph has fake quantization ops inserted to simulate the error
  introduced by quantization. Since the graph is transformed in place,
  the expected behavior of previously held references to nodes and tensors may
  change.
  The default value of quant_delay is suitable for finetuning an already trained
  floating point model (recommended).
  If one wants to train a quantized model from scratch, quant_delay should be
  set to the number of steps it take the floating point model to converge.
  Quantization will be activated at this point and effectively finetune the
  model. If quant_delay is not provided when training from scratch, training can
  often fail.
  Args:
    input_graph: The tf.Graph to be transformed.
    quant_delay: Number of steps after which weights and activations are
      quantized during training.
  Raises:
    ValueError: If elements contains an element that isn't a tf.Tensor or
      tf.Operation.
  """
  # TODO(raghuramank) Need to have freeze_bn_delay be a function of batch size
  # Currently the values below are hardcoded for mobilenetV1 on imagenet
  # Please use the experimental API if you need to tune these values.
  freeze_bn_delay = None

  **_create_graph(**
      input_graph=input_graph,
      is_training=True,
      quant_delay=quant_delay,
      freeze_bn_delay=freeze_bn_delay)

# 转到_create_graph
def _create_graph(input_graph=None,
                  is_training=True,
                  weight_bits=8,
                  activation_bits=8,
                  quant_delay=None,
                  freeze_bn_delay=None,
                  scope=None):
  """Rewrites an input_graph in place for simulated quantization.
  The graph has fake quantization ops inserted to simulate the error
  introduced by quantization. Since the graph is transformed in place,
  the expected behavior of previously held references to nodes and tensors may
  change.
  Args:
    input_graph: The tf.Graph to be transformed, if None then defaults to the
      default graph.
    is_training: Whether quantizing training or eval graph.
    weight_bits: Number of bits to use for quantizing weights.
    activation_bits: Number of bits to use for quantizing activations.
    quant_delay: Number of steps after which weights and activations are
      quantized during training.
    freeze_bn_delay: Number of steps after which moving mean and variance are
      frozen and used instead of batch statistics during training.
      freeze_bn_delay should be greater than quant_delay and should correspond
      to the number of steps when training has almost converged
    scope: The scope to be transformed. If it's not None, only the ops which
      are in this scope will be transformed.
  Raises:
    ValueError: If elements contains an element that isn't a tf.Tensor or
      tf.Operation.
  """

  if input_graph is None:
    input_graph = ops.get_default_graph()
  with input_graph.as_default():
    fold_batch_norms.FoldBatchNorms(
        input_graph,
        freeze_batch_norm_delay=freeze_bn_delay,
        is_training=is_training)
    **quantize.Quantize(**
        input_graph,
        is_training,
        quant_delay=quant_delay,
        weight_bits=weight_bits,
        activation_bits=activation_bits,
        scope=scope)

3.3 quantize.py

def Quantize(graph,
             is_training,
             weight_bits=8,
             activation_bits=8,
             ema_decay=0.999,
             quant_delay=None,
             vars_collection=ops.GraphKeys.GLOBAL_VARIABLES,
             scope=None):
  """Updates graph with quantization operations.
  Currently we quantize the following tensors:
  * Conv/MatMul: Quantize the weights if it matches.
  * Activation: Quantize the output if it matches.
  * Bypass/Post-activation Bypass: Quantize both input and output
    if it matches.
  Args:
    graph: Graph to modify.
    is_training: Whether quantizing training graph or eval graph.
    weight_bits: Number of bits to use for quantizing weights.
    activation_bits: Number of bits to use for quantizing activations.
    ema_decay: (Optional) Float, EMA decay parameter.  EMA is used to update
      quantization intervals for quantizing activations (see here about EMA:
      https://en.wikipedia.org/wiki/Moving_average#Exponential_moving_average).
    quant_delay: (Optional, default None) Int, count of global steps for which
      to delay quantization.  This helps weights stabilize at the start of
      training.
    vars_collection: (Optional) Collection where to store the variables for
      quantization interval ends.
    scope: The scope to be transformed. If it's not None, only the ops which
      are in this scope will be transformed.
  Raises:
    ValueError: When quantization fails.
  """
   ……
for layer_match in **_FindLayersToQuantize(graph):**
……
 **_InsertQuantOp(**
        add_context,
        'act_quant',
        layer_match.activation_op,
        consumer_ops,
        is_training,
        moving_avg=True,
        ema_decay=ema_decay,
        quant_delay=quant_delay,
        vars_collection=vars_collection,
        bits=activation_bits,
        init_min=0.0,
        producer_scope=scope)

# 转到
**def _FindLayersToQuantize(graph):**
  """Matches layers in graph to quantize.
  The following patterns get matched. Nodes surrounded by [] will be
  optionally matched:
          weight|folded_weight
                /
         conv|fc
            |
    [post_conv_correction]
            |
     biasadd|folded_bias
            |
         [bypass]
            |
        activation
            |
   [post_activation_bypass]
  Match replacements:
    If weight|folded_weight is found, FakeQuant is added afterwards.
    If bypass is found, FakeQuant is added before and after.
    If activation is found, FakeQuant is added afterwards.
    If post_activation_bypass is found, FakeQuant is added afterwards.
  Args:
    graph: Graph to perform match on.
  Returns:
    list of _LayerMatches.
  """
#接下来
**def _InsertQuantOp(context,**
                   name,
                   producer,
                   consumers,
                   is_training,
                   moving_avg=True,
                   init_min=-6.0,
                   init_max=6.0,
                   bits=8,
                   ema_decay=0.999,
                   quant_delay=None,
                   vars_collection=ops.GraphKeys.GLOBAL_VARIABLES,
                   narrow_range=False,
                   producer_scope=None,
                   consumer_scope=None):
  """Inserts a quant op between a producer op and (multiple) consumer ops.
  Args:
    context: Context where producer and consumer operations are nested.
    name: Name for the new quantization op within the context.
    producer: Producer operation of the pairs where quantization will be
      inserted.
    consumers: Consumer operations of the pairs.
    is_training: Whether quantizing training graph or eval graph.
    moving_avg: Specifies whether to use exponential moving average or just
      the last value seen.
    init_min: Starting minimum value for the new quantization op.
    init_max: Starting maximum value for the new quantization op.
    bits: Number of bits to use for quantization, must be between 2 and 8.
    ema_decay: (Optional) Float, EMA decay parameter.  EMA is used to update
      quantization intervals for quantizing activations (see here about EMA:
      https://en.wikipedia.org/wiki/Moving_average#Exponential_moving_average).
    quant_delay: (Optional, default None) Int, count of global steps for which
      to delay quantization.  This helps weights stabilize at the start of
      training.
    vars_collection: (Optional) Collection where to store the variables for
      quantization interval ends.
    narrow_range: Whether to use the narrow quantization range
      [1; 2^bits - 1] or wide range [0; 2^bits - 1].
    producer_scope: The restriction of producer scope. If not None, the new op
      will be inserted only when the producer is in this scope.
    consumer_scope: The restriction of producer scope. If not None, the new op
      will be inserted only when all the consumers are in this scope.
  Raises:
    ValueError: When producer operation is not directly connected to the
      consumer operation.
  """
  #接下来
  ### 对于 变量值
    if moving_avg:
    quant = (
        quant_ops.MovingAvgQuantize(
            inputs,
            init_min=init_min,
            init_max=init_max,
            ema_decay=ema_decay,
            is_training=is_training,
            num_bits=bits,
            narrow_range=narrow_range,
            vars_collection=vars_collection,
            name_prefix=name_prefix))
  else:
    quant = (
        quant_ops.LastValueQuantize(
            inputs,
            init_min=init_min,
            init_max=init_max,
            is_training=is_training,
            num_bits=bits,
            narrow_range=narrow_range,
            vars_collection=vars_collection,
            name_prefix=name_prefix))
### 对于 激活值
  if quant_delay and quant_delay > 0:
    activate_quant = math_ops.greater_equal(
        common.CreateOrGetQuantizationStep(),
        quant_delay,
        name=name_prefix + '/activate_quant')
    quant = control_flow_ops.cond(
        activate_quant,
        lambda: quant,
        lambda: inputs,
        name=name_prefix + '/delayed_quant')
### 对于 消费操作
  if consumers:
    tensors_modified_count = graph_editor.reroute_ts(
        [quant], [inputs], can_modify=consumers)
    # Some operations can have multiple output tensors going to the same
    # consumer. Since consumers is a set, we need to ensure that
    # tensors_modified_count is greater than or equal to the length of the set
    # of consumers.
    if tensors_modified_count < len(consumers):
      raise ValueError('No inputs quantized for ops: [%s]' % ', '.join(
          [consumer.name for consumer in consumers]))

四、Other Sources

1. tensorflow/tensorflow/contrib/lite/toco/graph_transformations/quantize.cc

2. https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/quantization_utils.h#L32

3. tensorflow/contrib/quantization/tools:quantize_graph【这个将已有的模型进行量化转换,装换成量化图】

4. tf.quantize

Defined in tensorflow/python/ops/array_ops.py.

tf.quantize(
    input,
    min_range,
    max_range,
    T,
    mode='MIN_COMBINED',
    round_mode='HALF_AWAY_FROM_ZERO',
    name=None
)

5. Fixed Point Quantization

5.1 Quantization training with TensorFlow
TensorFlow can train models with quantization in the loop. Because training requires small gradient adjustments, floating point values are still used. To keep models as floating point while adding the quantization error in the training loop, fake quantization nodes simulate the effect of quantization in the forward and backward passes.

Since it’s difficult to add these fake quantization operations to all the required locations in the model, there’s a function available that rewrites the training graph. To create a fake quantized training graph:

# Build forward pass of model.
loss = tf.losses.get_total_loss()

# Call the training rewrite which rewrites the graph in-place with
# FakeQuantization nodes and folds batchnorm for training. It is
# often needed to fine tune a floating point model for quantization
# with this training tool. When training from scratch, quant_delay
# can be used to activate quantization after training to converge
# with the float graph, effectively fine-tuning the model.
tf.contrib.quantize.create_training_graph(quant_delay=2000000)

# Call backward pass optimizer as usual.
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
optimizer.minimize(loss)

The rewritten eval graph is non-trivially different from the training graph since the quantization ops affect the batch normalization step. Because of this, we’ve added a separate rewrite for the eval graph:

# Build eval model
logits = tf.nn.softmax_cross_entropy_with_logits(...)

# Call the eval rewrite which rewrites the graph in-place with
# FakeQuantization nodes and fold batchnorm for eval.
tf.contrib.quantize.create_eval_graph()

# Save the checkpoint and eval graph proto to disk for freezing
# and providing to TFLite.
with open(eval_graph_file, ‘w’) as f:
  f.write(str(g.as_graph_def()))
saver = tf.train.Saver()
saver.save(sess, checkpoint_name)

Methods to rewrite the training and eval graphs are an active area of research and experimentation. Although rewrites and quantized training might not work or improve performance for all models, we are working to generalize these techniques.

5.2 Generating fully quantized models
The previously demonstrated after-rewrite eval graph only simulates quantization. To generate real fixed point computations from a trained quantization model, convert it to a fixed point kernel. Tensorflow Lite supports this conversion from the graph resulting from create_eval_graph.

First, create a frozen graph that will be the input for the TensorFlow Lite toolchain:

bazel build tensorflow/python/tools:freeze_graph && \
  bazel-bin/tensorflow/python/tools/freeze_graph \
  --input_graph=eval_graph_def.pb \
  --input_checkpoint=checkpoint \
  --output_graph=frozen_eval_graph.pb --output_node_names=outputs

Provide this to the TensorFlow Lite Optimizing Converter (TOCO) to get a fully quantized TensorFLow Lite model:

bazel build tensorflow/contrib/lite/toco:toco && \
  ./bazel-bin/third_party/tensorflow/contrib/lite/toco/toco \
  --input_file=frozen_eval_graph.pb \
  --output_file=tflite_model.tflite \
  --input_format=TENSORFLOW_GRAPHDEF --output_format=TFLITE \
  --inference_type=QUANTIZED_UINT8 \
  --input_shape="1,224, 224,3" \
  --input_array=input \
  --output_array=outputs \
  --std_value=127.5 --mean_value=127.5

See the documentation for tf.contrib.quantize and TensorFlow Lite.

6. quantize.cc

const MinMax& GetOrComputeMinMax(Model* model, const string& array_name) {
  auto& array = model->GetArray(array_name);
  // Normally we should have a MinMax recorded on this Array,
  // so we just use it.
  if (array.minmax != nullptr) {
    return *array.minmax;
  }

  // We don't have a MinMax. That's bad news: we need
  // the graph to provide MinMax info for all arrays in order
  // for inference to reproduce faithfully the same quantization
  // error as the training process had.
  //
  // But we still want to support a fallback for constant arrays,
  // just using the plain min and max computed from array elements.
  // We should hopefully never rely on that in production, as that
  // will not give very good accuracy as that typically won't be
  // exactly what the training process used. But it will be useful
  // to allow easily trying out quantization even if the graph
  // lacks some minmax information.
  if (array.buffer != nullptr) {
    LOG(WARNING)
        << "Constant array " << array_name
        << " lacks MinMax information. To make up for that, we will now compute"
        << " the MinMax from actual array elements. That will result in"
        << " quantization parameters that probably do not match whichever "
           "arithmetic"
        << " was used during training, and thus will probably be a cause of "
           "poor"
        << " inference accuracy.";
    CHECK(array.buffer->type == ArrayDataType::kFloat);
    const auto& data = array.GetBuffer<ArrayDataType::kFloat>().data;
    // We always want [min, max] to contain 0.
    float min = 0.f;
    float max = 0.f;
    for (auto val : data) {
      min = std::min(min, val);
      max = std::max(max, val);
    }
    if (min == 0.f && max == 0.f) {
      // Prevent downstream anger from quantized math that expects min and max
      // to not be equal.
      max = 1.f;
    }
    auto& minmax = array.GetOrCreateMinMax();
    minmax.min = min;
    minmax.max = max;
    return minmax;
  }

  LOG(FATAL) << "Array " << array_name
             << " does not have MinMax information, "
                "and is not a constant array. Cannot "
                "proceed with quantization.";
}

猜你喜欢

转载自blog.csdn.net/yifen4234/article/details/80382956
今日推荐