table of Contents

A. CRNN Introduction
- Brief introduction
- The internet
II. Extraction of local features CRNN
Three. CRNN sum of its parts BLSTM
Four. CRNN sum of its parts CTC
- About CTC is what?
- CTC theoretical basis
V. References

A. CRNN Introduction

Key: original papers must quest to see! ! ! English good look directly at the original papers, do not know where to find information. English well (such as the author), look at the Chinese data, and then look at the original papers .

Brief introduction

CRNN name is: An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognitionthat he is the end of the network, in fact, is not a strict sense, but rather a recognition network only.

Strict sense of the end of the network:Fast Oriented Text Spotting with a Unified Network

See below shows CRNNAfter entering the network, in order to obtain the detected character to character recognition.

Picture 1-1

The end is not strictly what does it mean?

FIG 1-2 shows the conventional character recognition, have to go to each of the divided character recognition (this stuff is not described, the conventional method is very simple)

Figure 1-2

FIG CRNN while directly input the results obtained.

The internet

CRNN into the network structure shown in Figures 1-3:

Figure 1-3

Feature Extraction

Normal image extraction, the extracted features to output sequential manner, where the reader can understand see RNN Digital Recognition Training

BLSTM

Wherein the input BLSTM, the output value of each sequence represented (this value is a sequence of values representing possible), the output is softmax operation, equal to the probability of occurrence for each possible value.

CTC

It corresponds to a LOSS, a probability that the actual output probability calculation section describes specific later.

Innovation

Bidirectional BLSTM extracted image feature, characterized in efficiency of recognition sequence
The speech recognition is introduced CTC-LOSS image, which is a qualitative leap

Insufficient point

Network complexity, especially BLSTM CTC and difficult to understand and difficult to calculate.
The use of sequence features for large angle value is difficult to identify.

II. Extraction of local features CRNN

Figure 2-1

FIG. 2-1 assumption is extracted feature (feature is a piece, which is certainly not a characteristic graph for looking comfortable)

After the image feature extraction VGG is the common feature map, then the division described above, wherein the sequence formed!

如果你的文字很斜或者是纵向的，那就得把特征竖向划分序列了！

三. CRNN局部之BLSTM

基本原理不懂读者可以看看这个教程

懂了原理这部分还是比较简单的（理解简单，实现太难了），笔者这里只介绍几个使用过程难理解的点

Figure 3-1

RNN输入序列数量

从上图可以得到的是X1---X6，总共6个序列

RNN的层数

图3-2

从上图3-2可以看出，是由五层网络构成

RNN神经元数量

图3-3

这里引用知乎大神的一个图，上图中序列为4，层数为3层（当然不加输入和输出也可以说是1层，这里按正常CNN去说就是3层了）

从图中可以看出每个序列包含一个CNN，图中的隐藏层神经元数量为24个，由于RNN使用权值共享，那么不同的神经元个数就为6个。

单个序列长度

以上图知乎大神的图为例子，每个输入序列长度8

假设这个网络是一个RNN识别手写数字识别的图，那么图像的宽为4，高为8

注意：输入序列的数量和输入序列的长度和神经元个数无关！！！这里想象RNN即可理解

BLSTM

图3-4

笔者只是推导了单向的LSTM网络，而没有推导BLSTM网络。

其实无论RNN如何变种，像现在最好的GRU等，无非都是在单元（unite）里面的trick而已。

具体公式推导，就是链式求导法则！建议先推RNN、然后LSTM、最后不用推导BLSTM都明白了

四. CRNN局部之CTC

关于CTC的描述网上很多，也讲解的比较清楚了，这里主要是说一下我笔者看原理时候的几个难点（弄了好久才想明白）

关于CTC是什么东西？

让我们来看一下正常分类CNN网络：

这是鸢尾花分类网络，其中输入一张图像，输出是经过softmax的种类概率。

那么这个网络标签是什么？？？

图4-2

标签的制作都是需要经过Incode(分类的种类经过数字化编码)，测试过程需要Encode(把输出的数字解码成分类的种类)

这很简单，读者应该都理解，代码为了计算机能看懂，编码就是神经网络能看懂。

那么RCNN如何编码呢？

图4-3

假设有26个英文字母要识别，那么种类数=27（还有一个空白blank字符）

假设CNN输出以50个序列为基准（读者这里看不懂就去看RNN识别手写数字识别），序列太大训练不准，识别结果会漏字母。序列太小训练不准，识别会多字母。

打个小比喻

图4-4

假设CTC是一个黑盒子，它能把输出那么多序列变化为一个序列，这样就能和CNN分类一样计算Loss了。当然不会那么简单，CTC还是比较复杂的，后面具体看这个黑盒子如何工作的。。。。

CTC理论基础

注释：这里笔者就不进行详细的描述了，感觉别人比我写的更好：非常详细的CTC力理论

在这一章，主要针对笔者遇到的重难点进行介绍：

训练--前向后相传播

本来还去看了马尔科夫的前后向传播的理论，没怎么看懂（数学基础太差）

针对本文的CTC前后向传播还是比较简单理解的

图4-5

其实这里可以理解为动态规划的方式进行的，因为其使用递归的方式，以一个点为中心，向前和向后进行递推，以动态规划的方式理解就很简单了。。。。不懂的读者可以刷leetcode，做几题就有感觉了

测试--CTC Prefix Search Decoding和CTC Beam Search Decoding

最简单的搜索追溯算法：

每个都列举最后计算，可以看出来是指数级搜索，效率肯定不行的

图5-6

Greedy algorithm + dynamic programming --- CTC Prefix Search Decoding:

The first step is to merge operations:

图5-7

The second step output maximum probability:

图5-8

Expand CTC Prefix Search Decoding algorithm --- CTC Beam Search Decoding

图5-9

CTC Prefix Search Decoding belong greedy algorithm, the optimal solution can be obtained Why?

I have a close look at the above title, CTC Prefix Search Decoding specifically added a dynamic programming, dynamic programming algorithm is the optimal solution belongs.

Because the premise of CTC algorithm is a sequence independent of each other , so the maximum current sequence, then the maximum the whole sequence.

Note: The sequence was the largest after the merger, rather than a single sequence greatest! ! ! If it is the largest single sequence, then this is a separate greedy algorithm.

Why CTC between sequences can also be calculated independently of each other have the text sequences, there must be a sequence of words between ah?

This had to re-look at the network, the network uses BLSTM, this sequence of things have been used, reaching CTC is already output after using the sequence.

People have to admire the design of the network RNN + CTC, use voice is the earliest.

In fact, looking back to think about, if there is a sequence of CTC, the forward and backward probability can not use the term Markov model (provided independent of each other), and can not be used CTC Prefix Search Decoding, only the simplest of retroactive algorithm, that efficiency is so low, how widely used it?