PP-Structure—Table Data Extraction

Table of contents

Introduction

characteristic

Show results

form recognition

Layout analysis and form recognition

Layout Restoration

key information extraction 

quick start

1. Prepare the environment

1.1 Install PaddlePaddle

1.2 Install PaddleOCR whl package

2 quick use

3. Easy to use 

3.1 Command line use

3.2 Use of Python scripts

3.3 Description of returned results

analysis Summary


Introduction

PP-Structure is an intelligent document analysis system developed by the PaddleOCR team, which aims to help developers better complete document understanding tasks such as layout analysis and form recognition.

The flow chart of PP-StructureV2 system is shown below,

  • The document image first passes through the image correction module to judge the direction of the whole image and complete the correction
  • Afterwards, two types of tasks, layout information analysis and key information extraction, can be completed.

In the layout analysis task, the image first passes through the layout analysis model, and the image is divided into different regions such as text, table, and image, and then these regions are recognized separately. For example, the table region is sent to the form recognition module for structural recognition, and the text The area is sent to the OCR engine for text recognition, and finally the layout recovery module is used to restore it to a word or pdf format file consistent with the original image layout;

In the key information extraction task, the OCR engine is used to extract the text content first, then the semantic entity in the image is obtained by the semantic entity recognition module, and finally the corresponding relationship between the semantic entities is obtained by the relationship extraction module, so as to extract the required key information.

The following are some sample pictures taken from the official website

characteristic

The main features of PP-StructureV2 are as follows:

  • Supports layout analysis of documents in the form of pictures/pdf, and can divide text, titles, tables, pictures, formulas and other areas;
  • Support common Chinese and English form detection tasks;
  • Supports structural recognition of the table area, and the final result is output as an Excel file ;
  • Support multimodal-based Key Information Extraction (Key Information Extraction, KIE) tasks - Semantic Entity Recognition (SER) and Relation Extraction (RE);
  • Support layout recovery , that is, restore to a word or pdf format file consistent with the original image layout;
  • Support custom training and python whl package calls and other reasoning deployment methods, easy to use;
  • Connect with PPOCRLabel, a semi-automatic data labeling tool, to support the labeling of three tasks: layout analysis, form recognition, and SER.

Show results

PP-StructureV2 supports independent use or flexible collocation of each module. For example, layout analysis or form recognition can be used alone. Here we only show the visualization effects of several representative usage methods.


form recognition

Difficulties:

The above two pictures are the table recognition effect on the official website. The effect of my own recognition is not as good as this.

My table has more than a dozen columns and a lot of data. Maybe this is also related.

But when I cut the table in half and the remaining columns are similar to the above picture, the result is still not so good. I don’t know what the problem is.

Using the same model, I wonder if he used PPOCRLabel to mark it separately

Layout analysis and form recognition

Layout Restoration

The figure below shows the effect of layout restoration based on the results of layout analysis and table recognition in the previous section.

key information extraction 

Boxes of different colors in the figure represent different categories.

quick start

1. Prepare the environment

1.1 Install PaddlePaddle

If you do not have a basic Python operating environment, please refer to Operating Environment Preparation .

  • Your machine is installed with CUDA9 or CUDA10, please run the following command to install    
python3 -m pip install paddlepaddle-gpu -i https://mirror.baidu.com/pypi/simple
  •  Your machine is a CPU, please run the following command to install
python3 -m pip install paddlepaddle -i https://mirror.baidu.com/pypi/simple

1.2 Install PaddleOCR whl package
# 安装 paddleocr,推荐使用2.6版本
pip3 install "paddleocr>=2.6.0.3"

# 安装 图像方向分类依赖包paddleclas(如不需要图像方向分类功能,可跳过)
pip3 install paddleclas>=2.4.3

2 quick use

Use the following command to quickly complete the recognition of a table.

cd PaddleOCR/ppstructure

# 下载模型
mkdir inference && cd inference
# 下载PP-OCRv3文本检测模型并解压
wget https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/ch_PP-OCRv3_det_infer.tar && tar xf ch_PP-OCRv3_det_infer.tar
# 下载PP-OCRv3文本识别模型并解压
wget https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/ch_PP-OCRv3_rec_infer.tar && tar xf ch_PP-OCRv3_rec_infer.tar
# 下载PP-StructureV2中文表格识别模型并解压
wget https://paddleocr.bj.bcebos.com/ppstructure/models/slanet/ch_ppstructure_mobile_v2.0_SLANet_infer.tar && tar xf ch_ppstructure_mobile_v2.0_SLANet_infer.tar
cd ..
# 执行表格识别
python table/predict_table.py \
    --det_model_dir=inference/ch_PP-OCRv3_det_infer \
    --rec_model_dir=inference/ch_PP-OCRv3_rec_infer  \
    --table_model_dir=inference/ch_ppstructure_mobile_v2.0_SLANet_infer \
    --rec_char_dict_path=../ppocr/utils/ppocr_keys_v1.txt \
    --table_char_dict_path=../ppocr/utils/dict/table_structure_dict_ch.txt \
    --image_dir=docs/table/table.jpg \
    --output=../output/table

After the operation is completed, the excel form of each picture will be saved in the directory specified by the output field, and an html file will be produced in this directory at the same time, which is used to visually view the cell coordinates and identify the table.

The model used above should be as good as it is now

If you need to change the model, and more usage methods, please see: ppstructure/table/README_ch.md · PaddlePaddle/PaddleOCR - Gitee.com


3. Easy to use 

3.1 Command line use

Executing the following command will automatically use the smaller model

  • Image orientation classification + layout analysis + form recognition
paddleocr --image_dir=ppstructure/docs/table/1.png --type=structure --image_orientation=true
  • Layout Analysis + Form Recognition

paddleocr --image_dir=ppstructure/docs/table/1.png --type=structure
  • layout analysis

paddleocr --image_dir=ppstructure/docs/table/1.png --type=structure --table=false --ocr=false
  • form recognition 
paddleocr --image_dir=ppstructure/docs/table/table.jpg --type=structure --layout=false

3.2 Use of Python scripts
  • Image orientation classification + layout analysis + form recognition
import os
import cv2
from paddleocr import PPStructure,draw_structure_result,save_structure_res

table_engine = PPStructure(show_log=True, image_orientation=True)

save_folder = './output'
img_path = 'ppstructure/docs/table/1.png'
img = cv2.imread(img_path)
result = table_engine(img)
save_structure_res(result, save_folder,os.path.basename(img_path).split('.')[0])

for line in result:
    line.pop('img')
    print(line)

from PIL import Image

font_path = 'doc/fonts/simfang.ttf' # PaddleOCR下提供字体包
image = Image.open(img_path).convert('RGB')
im_show = draw_structure_result(image, result,font_path=font_path)
im_show = Image.fromarray(im_show)
im_show.save('result.jpg')
  • Layout Analysis + Form Recognition 
import os
import cv2
from paddleocr import PPStructure,draw_structure_result,save_structure_res

table_engine = PPStructure(show_log=True)

save_folder = './output'
img_path = 'ppstructure/docs/table/1.png'
img = cv2.imread(img_path)
result = table_engine(img)
save_structure_res(result, save_folder,os.path.basename(img_path).split('.')[0])

for line in result:
    line.pop('img')
    print(line)

from PIL import Image

font_path = 'doc/fonts/simfang.ttf' # PaddleOCR下提供字体包
image = Image.open(img_path).convert('RGB')
im_show = draw_structure_result(image, result,font_path=font_path)
im_show = Image.fromarray(im_show)
im_show.save('result.jpg')
  • layout analysis 
import os
import cv2
from paddleocr import PPStructure,save_structure_res

table_engine = PPStructure(table=False, ocr=False, show_log=True)

save_folder = './output'
img_path = 'ppstructure/docs/table/1.png'
img = cv2.imread(img_path)
result = table_engine(img)
save_structure_res(result, save_folder, os.path.basename(img_path).split('.')[0])

for line in result:
    line.pop('img')
    print(line)
  • form recognition 
import os
import cv2
from paddleocr import PPStructure,save_structure_res

table_engine = PPStructure(layout=False, show_log=True)

save_folder = './output'
img_path = 'ppstructure/docs/table/table.jpg'
img = cv2.imread(img_path)
result = table_engine(img)
save_structure_res(result, save_folder, os.path.basename(img_path).split('.')[0])

for line in result:
    line.pop('img')
    print(line)

3.3 Description of returned results

The return result of PP-Structure is a list composed of dict, the example is as follows:

Layout Analysis + Form Recognition

[
  {   'type': 'Text',
      'bbox': [34, 432, 345, 462],
      'res': ([[36.0, 437.0, 341.0, 437.0, 341.0, 446.0, 36.0, 447.0], [41.0, 454.0, 125.0, 453.0, 125.0, 459.0, 41.0, 460.0]],
                [('Tigure-6. The performance of CNN and IPT models using difforen', 0.90060663), ('Tent  ', 0.465441)])
  }
]

Each field in the dict is described as follows:

field illustrate
type Type of image area
bbox The coordinates of the image area in the original image, respectively [upper left corner x, upper left corner y, lower right corner x, lower right corner y]
res OCR or form recognition results for image areas.
Table: a dict, the fields are described as follows :         In the code usage mode, the
        htmlHTML string of the table is passed forward return_ocr_result_in_table=True to get the detection and recognition results of each text in the table, corresponding to the following fields: : text detection coordinates : Text recognition results. OCR: A tuple containing the detection coordinates and recognition results of each single-line text

        boxes
        rec_res

After the operation is completed, each picture will outputhave a directory with the same name under the directory specified in the field, and each table in the picture will be stored as an excel, and the picture area will be cut and saved. The name of the excel file and picture is in the picture. coordinate of. 

/output/table/1/
  └─ res.txt
  └─ [454, 360, 824, 658].xlsx  表格识别结果
  └─ [16, 2, 828, 305].jpg            被裁剪出的图片区域
  └─ [17, 361, 404, 711].xlsx        表格识别结果

analysis Summary

The form recognition model used here is the third ppstructure_mobile_v2 below, which is currently the best

The performance evaluation is like this, the TEDS on the PubTabNet[1] evaluation data set is as high as 95.89%

  • TEDS: The accuracy of the model to restore table information. The evaluation content of this indicator includes not only the table structure, but also the text content in the table.

The text detection model and text recognition model are also up to date

It stands to reason that the recognition effect should be very good, however

My data is like this

 The recognition is like this, some blank lines cannot be recognized well, and some data will be squeezed into one line (the above and below pictures are not corresponding, just to talk about the problem)

In the comments below the official website, I also saw that some brothers posted comments that had the same problem as mine, so this should be a common problem.

However, the effect of recognizing simpler tables is still very good.

Official website: PaddlePaddle: An open source deep learning platform derived from industrial practice, Paddle is committed to making the innovation and application of deep learning technology easier (gitee.com)

ppstructure PaddlePaddle/PaddleOCR - Code Cloud - Open Source China (gitee.com)

Guess you like

Origin blog.csdn.net/weixin_45897172/article/details/131445327