AutoML helps AI performance tuning

640?wx_fmt=jpeg

Text / Suyog Gupta and Mingxing Tan 

For decades, as stated in Moore's Law, the performance of computer processors has doubled every few years by reducing the size of transistors in each chip. As it becomes more and more difficult to reduce the size of transistors, the industry has begun to focus on the development of specific domain-specific architectures (such as hardware accelerators) to continue to improve computing capabilities. This is especially true in the field of machine learning, where people are committed to building specialized architectures for neural network acceleration. Ironically, although these architectures have steadily developed in data centers and edge computing platforms, few neural networks are specifically optimized to make full use of these underlying hardware. 

Today, we are happy to announce the launch of EfficientNet-EdgeTPU, a series of image classification modules derived from EfficientNets. After customization, it can be used in Google’s Edge TPU (an energy-efficient hardware accelerator. Developers can use Coral Dev Board and USB Accelerator use) to run on, play the best performance. Through such module customization, Edge TPU can not only provide real-time image classification performance, but also achieve accuracy comparable to that of large-scale and computationally intensive models in data centers. 

注:EfficientNet-EdgeTPU

https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet/edgetpu

Use AutoML to customize EfficientNets for Edge TPU

It has been proved that EfficientNets can achieve SOTA (State Of The Art) accuracy in image classification tasks, while greatly reducing the size of the model and reducing the computational complexity. In order to build EfficientNets designed specifically to utilize the accelerator architecture of Edge TPU, we call the AutoML MNAS framework and use building blocks efficiently executed on Edge TPU to augment EfficientNet's initial neural network architecture search space (detailed below). We have also built and integrated a "delay time predictor" module. By running the model on a cycle-accurate architecture simulator, we can estimate the delay time of the model when it is executed on the Edge TPU. The AutoML MNAS controller searches this space by implementing a reinforcement learning algorithm, while trying to maximize the reward, which is a joint function of predictive delay time and model accuracy. Based on past experience, we know that when the model fits the on-chip memory, the energy efficiency and performance of the Edge TPU can often be maximized. Therefore, we have also modified the reward function to provide higher rewards for models that meet this constraint. 

Note: AutoML MNAS frame link

https://ai.googleblog.com/2018/08/mnasnet-towards-automating-design-of.html

Reinforcement learning link

https://en.wikipedia.org/wiki/Reinforcement_learning

640?wx_fmt=png

AutoML overall process for designing custom EfficientNet-EdgeTPU models

Search space design

When performing the above architecture search, it must be considered that EfficientNets mainly rely on depthwise-separable convolutions, which can decompose conventional convolutions, thereby reducing the number of parameters and the amount of calculation. However, for some configurations, although the amount of calculation is larger, conventional convolution can use the Edge TPU architecture more efficiently and perform faster. Although it is possible (albeit boring) to manually create a network that uses the best combination of different building blocks, using these accelerator-optimized blocks to augment the AutoML search space is a more scalable method. 

Note: depth separable convolution link

https://arxiv.org/abs/1610.02357

640?wx_fmt=png

The calculation (multiplication and addition (mac) operations) of the conventional 3x3 convolution (right image) is more than that of the depth separable convolution (left image), but for some input/output shapes, the former has about three times the effect due to its use Hardware, so it executes faster on Edge TPU

Some modules, such as swish nonlinearity and squeeze-and-excitation block, need to modify the Edge TPU compiler to get full support. Remove them in the search space, and you can produce a model that is easy to port to Edge TPU hardware. These operations tend to slightly improve the quality of the model. Therefore, after clearing it from the search space, we can effectively instruct AutoML to find an alternative network architecture that can compensate for any potential quality loss.

Note: swish non-linear link

https://arxiv.org/pdf/1710.05941.pdf

squeeze-and-excitation block link

https://arxiv.org/abs/1709.01507

Model performance

The Neural Architecture Search (NAS) described above generates a benchmark model, namely EfficientNet-EdgeTPU-S, which is then expanded using EfficientNet's compound scaling method to generate -M and -L models. The compound scaling method selects the best combination of input image resolution scaling, network width, and depth scaling to build a larger and more accurate model. -M and -L models can achieve higher accuracy, but the delay time is improved, see the figure below. 

640?wx_fmt=png

By providing a dedicated network architecture for the Edge TPU hardware, the EfficientNet-EdgeTPU-S/M/L model outperforms EfficientNets (B1), ResNet and Inception in terms of latency and accuracy. In particular, our EfficientNet-EdgeTPU-S can achieve higher accuracy, and its running speed is 10 times faster than ResNet-50

Note: ResNet link 

https://arxiv.org/abs/1512.03385

Inception link

https://arxiv.org/abs/1602.07261

Interestingly, the model generated by NAS widely uses conventional convolution in the initial part of the network. In the initial part, deep separable convolution is often not as effective as conventional convolution when executed on the accelerator. Obviously, this highlights the fact that the trade-offs commonly made when optimizing general-purpose CPU models (such as reducing the total number of operations) are not necessarily the best choice for hardware accelerators. In addition, these models can achieve high accuracy even without using advanced calculations. Compared with other image classification models (such as Inception-resnet-v2 and Resnet50), the EfficientNet-EdgeTPU model is not only more accurate, but also runs faster on Edge TPU. 

This is the first experiment using AutoML to build an accelerator-optimized model. Model customization based on AutoML can not only be extended to various hardware accelerators, but can also be used for different applications that rely on neural networks.

From Cloud TPU training to Edge TPU deployment

We have published the training code and pre-training model of EfficientNet-EdgeTPU in the GitHub code base. We use TensorFlow's post-training quantization tool to convert the floating-point training model into an integer quantization model compatible with Edge TPU. For these models, the quantization effect after training is very good, and the accuracy loss is minimal (about 0.5%). If you need to export the quantized model script from the training checkpoint, please get it at the end of the article. For updates on the Coral platform, please refer to the blog post on the Google Developer Blog. For complete reference materials and detailed instructions, please check the Coral website. 

Thanks

Special thanks to Quoc Le, Hongkun Yu, Yunlu Li, Ruoming Pang and Vijay Vasudevan of Google Brain team; Bo Wu, Vikram Tank and Ajay Nair of Google Coral team; Han Vanholder, Ravi Narayanaswami, John Joseph, Dong Hyuk of Google Edge TPU team Woo, Raksit Ashok, Jason Jong Kyu Park, Jack Liu, Mohammadali Ghodrat, Cao Gao, Berkin Akin, Liang-Yun Wang, Chirag Gandhi and Dongdong Li.

If you want to learn more about EfficientNet-EdgeTPU, please refer to the following documents. These documents delve into many of the topics mentioned in this article:
  • EfficientNets 

    https://ai.googleblog.com/2019/05/efficientnet-improving-accuracy-and.html

  • Edge TPU

    https://coral.withgoogle.com/docs/edgetpu/faq/

  • Coral Dev Board

    https://coral.withgoogle.com/products/dev-board

  • USB accelerator

    https://coral.withgoogle.com/products/accelerator

  • GitHub code repository

    https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet/edgetpu

  • Quantization tool after training

    https://www.tensorflow.org/model_optimization/guide/quantization

  • Get model script

    https://github.com/tensorflow/tpu/blob/master/models/official/efficientnet/edgetpu/README.md

  • Coral platform update related

    https://developers.googleblog.com/2019/08/coral-summer-updates-post-training.html

  • Coral website

    https://coral.withgoogle.com/docs/

640?wx_fmt=gif

Guess you like

Origin blog.csdn.net/jILRvRTrc/article/details/100972516