The most complete OCR technical guide in 2023! Pre-trained OCR large model is coming soon

OCR is a technological innovation that helps users extract text from images or scanned documents and convert them into a computer-readable format by automating the process of greatly reducing manual entry. This function is particularly useful in many scenarios that require further data processing, such as identity verification, expense management, automatic reimbursement, business processing, etc. Today, OCR solutions combine AI (Artificial Intelligence) and ML (Machine Learning) technologies to automate the process and improve the accuracy of data extraction. This article will introduce the past and present of this technology, and take a look at the staged development of this technology: the past dominated by traditional OCR technology, the flashy present of deep learning OCR technology, and the imminent future of pre-trained OCR large models!

1. The past life of OCR: the past ruled by traditional OCR technology

How Traditional OCR Technology Works

The way OCR works can be compared to the human ability to read text and recognize patterns. Traditional OCR technology automatically recognizes and extracts characters in images or documents through computer vision and pattern recognition technology. Traditional OCR technology needs to go through the following steps:

1. Image preprocessing

This stage is to enhance the quality of the image, including denoising, binarization (that is, converting the image to black and white), and automatically correcting the distortion and skew of the image.

* Application of Image

In the optical character recognition (OCR) workflow, image preprocessing is the first step, which lays the foundation for the accuracy and robustness of the entire system. Therefore, it is critical to understand the techniques used in image preprocessing and the steps in which they are performed.

* Definition of image

Image preprocessing is a technique for improving image data (removing useless information, enhancing useful information, or increasing computational speed) before the main image analysis is performed. It can enhance the image quality, make the OCR engine better separate the text and the background, and improve the recognition accuracy of the text.

* The main steps and techniques of image

1. Denoising : In this step, various filters (for example, median filter, Gaussian filter, etc.) are used to reduce noise in the image, such as dust, scratches, etc.

2. Grayscale : Convert a color image to a grayscale image. Because in most cases, we only need to pay attention to the contrast between text and background, not their colors. Grayscale can greatly reduce the computational complexity while retaining the main information.

3. Binarization : This step converts the image into an image containing only black and white. The binarization process can be implemented by setting a threshold, all pixels below this threshold will be marked as black, and pixels above the threshold will be marked as white. This can further enhance the contrast between the text and the background.

4. Deskewing and correction : OCR systems need to automatically correct distortion and skew in images to ensure correct recognition of text. This process involves detecting the skew angle of lines of text in the image and correcting them accordingly.

5. Area delineation : Also known as layout analysis, this step is used to identify text areas, non-text areas, and text structural information in the image, such as columns, rows, blocks, headings, paragraphs, tables, etc. Through this step, a basis can be provided for the subsequent text extraction steps.

*Importance of image preprocessing*

Good image preprocessing can significantly improve the efficiency and accuracy of subsequent steps. It improves image quality, better separates text and background, removes noise from images, corrects distortions and skews in images, recognizes structural information in text, and more. These are the key factors to ensure that the OCR system can accurately recognize and extract text. Therefore, an in-depth understanding and mastery of image preprocessing steps and techniques is crucial to building an efficient and accurate OCR system.

2. Character segmentation

Character segmentation is an important step in the OCR process. The goal of this step is to segment the text area in the image into independent characters for character recognition in subsequent steps. Following are the main steps and some common techniques used for character segmentation.

*The main steps of character segmentation*

1. Line Segmentation : The goal of this step is to segment text regions in the image into individual lines. Typically, row segmentation can be achieved by analyzing the horizontal projection histogram of the image. The horizontal projection histogram is obtained by accumulating the gray value of each pixel in the image in the horizontal direction. Between lines of text, the accumulated value usually drops significantly, and the places where these drops are where the line splits occur.

2. Character segmentation : After line segmentation, the next step is to further segment each line of text into individual characters. This can usually be achieved by analyzing vertical projection histograms. Similar to the horizontal projection histogram, the vertical projection histogram is obtained by accumulating the gray value of each pixel in the vertical direction. Between characters, the accumulated value also usually drops significantly, and the locations of these drops are where the character splits occur.

* Common problems and solutions of character segmentation*

In the process of character segmentation, there are some common problems, such as character sticking and disconnection. These issues may cause characters not to be segmented correctly, thus affecting the accuracy of OCR.

1. Character sticking : Sometimes, two or more characters in an image may be closely connected to form a shape that looks like a single character. To solve this problem, a common method is to separate cohesive characters through morphological operations. For example, you can use thinning or skeletonization techniques to extract the centerlines of characters, and then separate glued characters based on these centerlines.

2. Character disconnection : Sometimes, a character in an image may be disconnected into two or more parts due to noise or other reasons. To solve this problem, a common method is to connect disconnected characters through morphological operations. For example, you can use dilation or closing techniques to fill holes in characters, and then connect disconnected characters based on these filled shapes.

Overall, character segmentation is a key step in OCR. Only when the characters in the image are accurately segmented, the OCR system can correctly identify and extract these characters. Therefore, an in-depth understanding and mastery of the steps and techniques of character segmentation is crucial to building an efficient and accurate OCR system.

3. Character recognition

In this step the image or document is broken down into parts or regions and the characters in them are recognized. This process involves matrix matching (i.e. each character is compared to a matrix library of characters) and feature recognition (i.e. identifying text patterns and character features from images).

* Character recognition technology*

Character recognition is a critical step in the optical character recognition (OCR) workflow. In this step, the system needs to recognize each individual character obtained by segmentation. The following are the main techniques and steps in the character recognition stage, especially in traditional OCR systems.

*feature extraction*

Feature extraction is the first step in character recognition, and its purpose is to extract features that can reflect its main shape and structure from each character image. These characteristics can help distinguish different characters. In traditional OCR systems, common feature extraction methods include:

  1. *Gray Level Co-occurrence Matrix (GLCM)* GLCM is a statistical method for extracting texture features from images. These features include contrast, correlation, energy, and homomorphism, among others.
  2. *Hu invariant moments* Hu invariant moments are a set of features that can resist changes in image translation, scaling, and rotation. 
  3. *Fourier Descriptor * Fourier Descriptor can extract features from the shape of characters, especially the boundaries of characters.

*character classification *

After the features are extracted, the next step is to use these features to classify the characters. In traditional OCR systems, the most common classifier is a Support Vector Machine (SVM).

  1. *Support Vector Machine (SVM)* SVM is a supervised learning model that performs classification by finding the decision boundary that maximizes the distance between classes.

When training a classifier, a character set labeled with real class labels is required. When performing character recognition, the classifier will output a category label according to the input features, and this label is the result of the recognition.

* Performance Evaluation*

After the character recognition is completed, the performance of the system needs to be evaluated. Commonly used performance metrics include accuracy, precision, recall, and F1 score. These metrics can help us understand how a classifier performs under different conditions so that it can be optimized and improved.

Limitations of Traditional OCR

While traditional Optical Character Recognition (OCR) technology works quite well in many scenarios, it does have some limitations, especially in complex or challenging situations. Here are some major limitations:

1. High requirements for clarity and quality : Traditional OCR technology is highly dependent on image quality. If the input image is of poor quality (eg, blurry image, low contrast, uneven lighting, presence of noise, etc.), the accuracy of OCR may be greatly reduced.

2. Dependence on fonts and layouts : Traditional OCR techniques are usually trained based on specific fonts and layouts. Therefore, if the input text uses a font not included in the training data or has a different layout, the recognition accuracy may be affected.

3. The challenge of dealing with : If text characters are closely connected to the background or the text is on a complex background, traditional OCR systems may have difficulty segmenting and recognizing characters accurately. Similarly, traditional OCR systems may not be able to accurately recognize characters if they are decorated or rendered in WordArt.

4. Difficult handwriting recognition : For the recognition of handwritten text, traditional OCR systems usually encounter greater challenges, because the shape, size, and inclination of handwritten text vary greatly, and often lack clear boundaries.

5. Unable to handle multiple languages ​​and special characters : Traditional OCR systems are usually optimized for a single or a few languages, and may not be able to provide satisfactory recognition results for other languages ​​or special characters, such as mathematical symbols and musical symbols.

6. Lack of context understanding : Traditional OCR technologies usually treat character recognition as an independent task without considering the context information of characters. Therefore, if a character is blurred in the image, the OCR system may not be able to accurately recognize this character.

In general, although traditional OCR technology performs very well in some scenarios, when dealing with complex or challenging tasks, the limitations of this technology will be exposed. This is why more and more researchers have begun to explore the use of more advanced techniques such as deep learning to improve OCR systems.

2. The present life of OCR: the present of deep learning OCR technology

Traditional OCR technology is not ideal when dealing with complex images and irregularly shaped text. In the era of deep learning, machines can "learn" to handle complex tasks and have good adaptability to data. By combining deep learning to build a more powerful and flexible OCR model, it can process various types of text and improve the accuracy of character recognition.

Deep learning OCR technology is divided into two steps: text detection and text recognition.

Deep Learning Text Detection

Proposal-based based on the candidate box: an example of FastRCNN

FastRCNN (Fast Regional Convolutional Neural Network) is a deep learning model for target detection. It uses the Region Proposal Network (RPN) to find out the area where the target may exist in the image, and then passes a convolution The network performs feature extraction and classification on these regions. It can achieve higher computing speed and more accurate target detection when processing image data.

In the OCR (Optical Character Recognition, Optical Character Recognition) scenario, FastRCNN can be used to locate and recognize text content in images. It can recognize all forms of text, including printed, handwritten and even unstructured text. Since FastRCNN is a two-level task model, it first locates the text area and then performs text recognition, which makes the model highly efficient and accurate in handling text recognition tasks in complex scenes.

https://arxiv.org/pdf/1506.01497.pdf

Technical Description

For text region detection, FastRCNN generates possible text region proposals via RPN. RPN is a fully convolutional network that can generate potential text regions anywhere in the image, which plays an important role in processing various complex images, especially images containing multiple text regions of different sizes and complex layouts .

technical steps

** Region Proposal :** Utilizes the RPN network to generate latent text region proposals on preprocessed images.

** Feature extraction and classification :** Use FastRCNN for feature extraction and classification for each proposed region. Since FastRCNN can perform feature sharing in different regions, it can greatly improve computational efficiency without sacrificing accuracy.

**Post- processing : **Process the model output, including merging, deduplication and sorting of the detected text regions, and finally return the detection and recognition results to the user.

** Continuous learning and optimization :** According to the performance of the model in practical applications, collect feedback data, continuously optimize and train the model, and improve its performance in complex scenarios.

Segmentation-based: an example of MaskRCNN

Mask-RCNN is an object detection model based on deep learning. Its main feature is simultaneous object detection and pixel-level image segmentation. The model adds a parallel segmentation task on the basis of FastRCNN, which can output information such as classification, location and shape of the target.

In the OCR (Optical Character Recognition, Optical Character Recognition) scenario, Mask-RCNN can be used to finely detect and segment text. Since Mask-RCNN can not only recognize the text in the image, but also accurately give the shape and position of the text, this makes it especially suitable for processing text images with complex layouts and shapes.

Technical Description

The application of Mask-RCNN in OCR scenarios mainly involves text region detection and shape segmentation.

First, like FastRCNN, Mask-RCNN generates possible text region proposals via RPN. Then, for each proposed region, Mask-RCNN not only performs the classification and regression tasks of FastRCNN, but also performs an additional parallel pixel-level segmentation task.

In OCR, this segmentation task can be used to generate precise shape and position information of text, which has important application value for processing text images with complex layout and shape, such as free-form text, vertically or obliquely arranged text.

technical steps

**Region Proposal :** Utilizes the RPN network to generate latent text region proposals on preprocessed images.

**Feature extraction , classification and segmentation :** For each proposed region, Mask-RCNN simultaneously performs feature extraction, classification and pixel-level segmentation. Through these tasks, the category, position and precise shape of each character can be obtained.

**Post- processing : **Process model output, including merging, deduplication and sorting of detected text regions, and generating precise shape and position information of text based on segmentation results.

**Continuous learning and optimization: **According to the performance of the model in practical applications, collect feedback data, continuously optimize and train the model, and improve its performance in complex scenarios.

Deep Learning Text Recognition

When we talk about the technical route of deep learning OCR text recognition, there are three main directions: CTC-based decoding method, Attention-based decoding method, and character segmentation-based method.

CTC-based decoding method:

Imagine that you are listening to a piece of audio and you need to convert the dialogue into text. This requires a system capable of translating sounds into characters in chronological order. This is the concept of CTC (Connectionist Temporal Classification). What CTC solves is how to convert audio (or images) with a fixed length of time into text of an unfixed length.

CTC (Connectionist Temporal Classification) is a special decoding method for sequential problems. In the OCR task, it can help us establish a mapping relationship between fixed-dimensional time-series features and non-fixed-dimensional outputs (for example: text strings). So how exactly does it work?

Technical Description

The key innovation of CTC is the introduction of a special symbol, often referred to as the "space" character or "blank" character. This character has no actual semantic meaning, but it plays a key role in training the model.

Specifically, when we train a model, we need a fixed-length input (such as an image) to correspond to a fixed-length output (such as a sequence of characters). But in the OCR problem, the width of the input image (or the timing length of the feature) is often fixed, while the number of characters output varies, which leads to a "misalignment" problem between the input and output.

CTC effectively solves this problem by introducing the "space" character. At training time, we can predict a probability for each possible character, while also predicting the probability of a "space" character. We can then generate the final sequence of characters from these predicted probabilities through a process called "decoding".

technical steps

When we use the CTC-based decoding method to deal with the OCR problem, the following technical steps are generally adopted:

1. Feature extraction : First, we need to extract useful features from the input image. This is usually done with deep learning models such as CNNs. The width of each image is divided into small blocks ("time steps"), and a feature vector is generated for each small block.

2. Sequence Prediction : We then feed these feature vectors into a Recurrent Neural Network (RNN) that predicts a character for each time step and simultaneously predicts a "space" character.

3. CT C Decoding : Finally, we use the CTC decoding algorithm to generate the final sequence of characters from the predicted probabilities. In this process, the "space" character plays an important role: it can be used to indicate the boundary between characters, and it can also be used to indicate a time step where there are no characters.

In this decoding method, the CRNN+CTC model is a very typical representative. CRNN (Convolutional Recurrent Neural Network) combines the features of Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) to efficiently extract features from images and perform sequence prediction.

It is worth noting that although the CTC-based decoding method has significant advantages in dealing with fixed-length input and variable-length output, it does not make full use of context information when predicting each character, so it is difficult to deal with irregular shapes. There may be some degradation in text such as curved or handwritten text.

Attention-based decoding method: technical depth analysis

When we read, we always pay attention to some specific parts and ignore other less important information. In this process, we always look for important information in the context, which is the Attention mechanism.

Attention-based decoding is a method widely used in deep learning, especially when dealing with sequential problems, such as machine translation and OCR, it performs well. It's called "Attention" because it mimics the human tendency to focus on key parts of information.

Technical Description

The basic idea of ​​the Attention mechanism is that when making predictions, the model should "focus" on the most relevant part of the input. In the context of OCR, this means that when predicting a character, the model should focus on the regions of the image that are most relevant to that character.

The Seq2Seq+Attention model is a typical model based on Attention. This model usually consists of two parts: an encoder (Encoder) and a decoder (Decoder). The task of the encoder is to convert the input image into a set of feature vectors. The task of the decoder is to convert these feature vectors into character sequences.

Different from the traditional Seq2Seq model, the decoder here will select and pay attention to which feature vectors through the Attention mechanism when generating each character. In other words, the model "focuses" on those features that are most helpful for the current prediction.

technical steps

Using the Attention-based decoding method to deal with the OCR problem generally adopts the following technical steps:

1. Feature extraction : First, we need an encoder (usually a deep neural network, such as CNN) to convert the input image into a set of feature vectors.

2. Sequence prediction : Then, we need a decoder (usually a recurrent neural network, such as RNN or LSTM) to convert these feature vectors into character sequences. When generating each character, the decoder uses the Attention mechanism to select and focus on which feature vectors.

3. Attention decoding : Through the Attention decoding process, the model can generate a series of characters, which together form the final text result. It is worth noting that since each step of prediction depends on the context information of all previous steps, this method can usually achieve better results when dealing with complex and irregular text.

Although the Attention-based decoding method works better when dealing with irregularly shaped text, such as curved text or handwritten text, it should be noted that when the processed text is too long or too short, this method may be effective. reduce. In addition, since the model needs to consider all context information, the computational complexity is relatively high, which is also a point that needs to be paid attention to in the Attention-based decoding method.

Character-Based Segmentation Methods

When we read words, we read them letter by letter. This method is very effective for dealing with curved text and irregular text, but the premise is that we need to accurately label each character, which is the method based on character segmentation.

In the field of OCR, the method based on character segmentation is a more traditional solution. Its core idea is to decompose the OCR problem into two sub-problems: character detection and character recognition. This method has certain advantages in dealing with curved text and irregular text, but has higher requirements for character labeling.

Technical Description

The method based on character segmentation first uses image processing technology to segment each character in the image, and then recognizes each character individually. The advantage of this approach is that it can handle text of various shapes and sizes, especially curved and irregular text. And, since each character is handled individually, it also handles inconsistent character spacing well.

However, this approach also has its limitations. Since it requires precise positioning and segmentation of each character, it has high requirements for character labeling. In practical applications, due to various interference factors (such as lighting, background noise, font style, etc.), it is difficult to achieve completely accurate character segmentation.

technical steps

Using the method based on character segmentation to deal with the OCR problem generally adopts the following technical steps:

1. Character detection : First, we need to use a character detection algorithm (such as sliding window or region-based method) to locate and segment each character in the image. This usually requires extensive image processing techniques such as edge detection, morphological operations, etc.

2. Character recognition : Then, we need to recognize each segmented character. This can be done with a classifier such as SVM or deep neural network. Each character is recognized individually and then combined to form the final text.

3. Character sorting : After identifying all the characters, we also need to sort them to get the correct reading order. This can usually be done with spatial relationships (e.g. left to right, top to bottom) or sequential models (e.g. HMMs).

In this process, the positioning, segmentation and recognition of characters are all key steps, and the results of each step will directly affect the final OCR performance. Therefore, although the method based on character segmentation has its advantages in dealing with some complex texts, it also needs to weigh its complexity and accuracy in practical applications.

Transformer-based method

The Transformer model has shown great potential in the NLP field in recent years, and its excellent performance has also attracted the attention of the OCR field. The Transformer-based method provides a new way to deal with the OCR problem, which can solve the limitations of CNNs in dealing with long-term dependencies.

 

Technical Description

The core of the Transformer model is the Self-Attention mechanism, which enables the model to have a global perspective on each element when processing sequence data. In OCR problems, this means that the model can simultaneously consider all regions in the image when predicting a character, not just local regions.

The Transformer model usually consists of two parts: an encoder (Encoder) and a decoder (Decoder). The task of the encoder is to convert the input image into a set of feature vectors. The task of the decoder is to convert these feature vectors into character sequences. It is worth noting that due to the self-attention mechanism, the encoder and decoder can consider all feature vectors or characters when processing each feature vector or character.

technical steps

Using Transformer-based methods to deal with OCR problems generally adopts the following technical steps:

1. Feature extraction : First, we need an encoder (usually a deep neural network, such as CNN) to convert the input image into a set of feature vectors.

2. Sequence prediction : Then, we need a Transformer-based decoder to convert these feature vectors into character sequences. When generating each character, the decoder uses a self-attention mechanism to select and focus on which feature vectors.

3. Character combination : Finally, the decoder combines the generated character sequences into the final text result. Since the Transformer model considers all feature vectors for each feature vector, this method usually achieves better results when dealing with complex and irregular text.

Overall, the Transformer-based approach is a very promising way to deal with the OCR problem. Not only can it overcome the limitation of CNN in dealing with long-term dependency problems, but also it can perform well in complex text due to the existence of self-attention mechanism. However, due to the relatively large computational load of the Transformer model, in practical applications, attention needs to be paid to the balance between computing resources and model performance.

3. The future of OCR: the imminent future of pre-training OCR large models

Currently, NLP and CV pre-training large models (OpenAI GPT, Meta SAM) have shown strong performance. By pre-training on a large amount of unlabeled data, large models can learn a large number of visual features and language features, which will greatly improve the performance of the model on downstream tasks. At present, research in this area is in a rapid development stage, and some studies have shown that the pre-trained large model of joint character-level and field-level text multimodal feature enhancement has great potential in OCR tasks.

Looking forward to the future, we expect that the pre-trained large model can further improve the performance of OCR, especially in dealing with multilingual, complex scenes, long text and other issues. At the same time, it is also necessary to study how to reduce the computing resource consumption of the model while ensuring the performance, so that these models can be applied in a wider range of devices and scenarios.

Hehe TextIn.com has been focusing on the field of intelligent text for 15 years

 

Guess you like

Origin blog.csdn.net/INTSIG/article/details/131687641