PubLayNet: the largest ever dataset for electronic document element recognition

Summary

  • Importance: Identifying the layout of elements in an unstructured electronic document and converting it into a computer-understandable format is important for downstream tasks.
  • Current shortcomings: Existing publicly available document layout datasets for training deep neural networks are too small, making all models that handle this type of task difficult. It must be obtained through transfer learning using a deep neural network trained on a traditional image data set.
  • Contribution of this article: Proposed a large-scale data set for document layout analysis, containing more than 1 million PDF articles and more than 360,000 document page images. The content in these articles was matched to the XML representation.
  • Experimental results: The neural network trained using this data set can accurately detect the page layout of electronic documents, and the pre-trained model can also be well applied to Transfer learning.
  • Dataset publishing website:https://github.com/ibm-aur-nlp/PubLayNet.

introduction

  • PDF format files are widely used, but the automated processing of files in this format is very complex.
  • Training a model to process PDF files through machine learning (deep learning) methods requires a large number of manually annotated data sets, which is a time-consuming and expensive task.
  • This paper proposes a method that can automatically annotate large-scale PDF datasets and a high-quality PDF document layout dataset PubLayNet.
  • The experimental results show that the model trained by the automatically annotated dataset is suitable for identifying the layout in scientific articles, and better results can be achieved after transfer learning of the model pre-trained on this dataset.

Related work

  • Existing datasets for document layout analysis rely on manual annotation and are very small.

Automatic annotation of document layout

  • Data source: Documents on PubMed Central Open Access (PMCOA). These documents are available in both PDF and XML formats. In the automatic annotation stage, two different formats of the same document need to be used.

layout category

  • The XML representation of a document contains many different types of nodes, and even for humans, it is difficult to distinguish these different types of nodes just by looking at the document image.
  • The authors weighed multiple factors and selected five document layout elements: text, headings, lists, tables, and images.
    Insert image description here

Labeling algorithm

  • Annotation algorithm overview: First, the layout elements in the PDF document are matched with the nodes in the XML format document; then the bounding box and segmentation of the layout elements in the PDF document are calculated ; The node names in the XML are used to determine the category label for each bounding box; Finally, a quality control indicator is used to limit the noise in the annotation process to a very low level.
PMCOA XML preprocessing and parsing
  • XML preprocessing: First, the node tags in the XML document tree that are not suitable for matching are removed to ensure that these tags do not affect the results.
  • Classify the nodes in the XML tree into five categories:
    • The paper title, abstract, keywords, section headings and main text are grouped together because these are all in the same reading order.
    • Copyright notice, credentials, authors, affiliations, acknowledgments, and abbreviations are placed on one page, as these are not necessarily read in the same order as the PDF document.
    • Images are grouped into a separate category (including the image's caption and the image itself).
    • Tables are grouped into a separate category (including table text, footnotes, and the table itself).
    • The list is divided into a separate category.
PMCOA PDF analysis

The authors divide the elements in PDF into three major categories, as follows:

  • Text box (outlined in red): Consists of lines of text. Each text box contains the text within the text box, the text box's bounding box, and the text lines within the text box. A text line in turn contains the text within the text line and the text line's bounding box.
  • Image (outlined in green): Consists of images, each containing a bounding box.
  • Geometric shapes (outlined in yellow): Consisting of straight lines, curves and rectangles, each geometric box is associated with a bounding box.
String preprocessing
  • Strings in XML and PDF files are encoded in Unicode;
  • In order to make the matching between XML format and PDF format more robust, KD regularization is performed on the strings.
PDF-XML matching algorithm
  • There are always slight differences between the PDF content parsed by PDFMiner and the text composed of XML nodes. Therefore, the authors resorted to fuzzy string matching algorithm because this algorithm can accept smaller differences.
  • Use the fuzzysearch package to find the string that best matches (that is, the closest distance) to a known string. The distance here refers to the Levenstein distance between the two strings.
  • The longer the string, the higher the matching distance requirement.
Generate instance segmentation
QC
  • There may be a big gap between some PDF parsing results and the XML file corresponding to the document. When the gap is greater than a threshold, the annotation algorithm may not be able to identify elements in a document page. Therefore, a way is needed to evaluate the quality of annotated PDF pages and eliminate poorly annotated PDF pages in PubLayNet.
  • Evaluation index of annotation quality: the ratio of the area of ​​the annotated text boxes, pictures and geometric shapes to the actual area of ​​the text boxes, pictures and geometric shapes. For non-title pages, less than 99% of PDF pages are removed (this is a very high standard)

Data partition

  • The annotated PDF documents are divided according to the journal level and divided into three parts: training set, verification set and test set.
  • For each journal selected into the validation set and test set, it is required that the number of pages cannot be too many, and the number of images, tables, and lists cannot be too few, so as to prevent the set from being dominated by a certain journal.
  • Half of the journals were randomly extracted as the validation set, and the other half were extracted as the test set. In order to eliminate noise in the validation set and test set, the pages with essential errors were manually removed, and the pages with slight errors were manually corrected.
  • Journals that did not meet the requirements for inclusion in the validation and test sets were used to generate the training set.
  • The PubLayNet dataset is one to two orders of magnitude larger than any existing document distribution dataset.

result

Document distribution recognition based on deep learning

  • The authors first converted PDF document pages into image form, and then trained a Fast-RCNN model and Mask-RCNN model respectively.
  • The number of iterations of each model is 180,000 rounds, and the basic learning rate is 0.01. The learning rate drops by a factor of 10 at epoch 120,000 and epoch 160,000, respectively, using a batch size of 8. The model is trained on 8 GPUs, and the backbone network of each model is ResNet-101 pre-trained on ImageNet. The evaluation indicators of the model are MAP and IOU.
  • Experimental results show that both the Fast-RCNN model and the Mask-RCNN model can achieve MAP values ​​greater than 0.9.
  • The model was more accurate at identifying tables and images than text, headings, and lists. Additionally, the model has the lowest accuracy in identifying titles.
  • Another reason why the authors believe that the model performance can still be improved is the noise in the PubLayNet dataset, and promise to improve the quality of this dataset.

Table detection

  • The ICDAR 2013 Table Competition is one of the most famous PDF document table detection competitions.
  • The authors also created a PDF page data set containing tables based on the PDF documents they obtained, and used these data sets to train a Fast-RCNN model and a Mask-RCNN model respectively, and then used 170 images provided by the competition PDF pages are fine-tuned.
  • Configuration of the fine-tuning process: the initial learning rate is set to 0.001, and the learning rate is reduced by ten times every 10 iterations, for a total of 200 iterations.
  • The fine-tuned Fast-RCNN model achieved the best results so far in the ICDAR 2013 evaluation results, with an accuracy rate of 0.972.

Fine-tuning for different areas

  • 2131 summary plan description PDF files (SPD) of health insurance companies were manually annotated.
  • The pre-trained Fast-RCNN and Mask-RCNN models were used for fine-tuning respectively, and the results were compared using five-fold cross-validation.
  • Three fine-tuning methods are used, namely using a model pretrained on ImageNet to initialize the backbone network, using a model pretrained on COCO to initialize the entire network, and using a model pretrained on PubLayNet to initialize the entire network. In addition, a comparison was made between directly using the PubLayNet pre-trained network without fine-tuning.
  • In the results, the effect of the model without fine-tuning is significantly lower than the fine-tuning results of other pre-trained models, and fine-tuning can greatly improve the results.
  • Among the three models that have been fine-tuned, overall the model fine-tuned based on PubLayNet has the best effect.
  • In addition, the authors also found that for table types, the recognition performance improvement brought by fine-tuning is minimal.

discuss

  • The worst-performing type of automatic annotation process: The authors found that the title annotation process of articles was the worst-performing part. This was caused by the different ways of displaying titles in different articles.
  • In order to make PubLayNet applicable to more fields, the authors verified that the types of journals in the training set are rich enough so that the model can predict the layout of unseen journals well; for other types of PDF files, it can still be passed Transfer learning gets great results.

in conclusion

  • The currently largest document layout annotation data set, PubLayNet, is automatically generated.
  • The state-of-the-art object detection neural network trained on PubLayNet has achieved good results in layout recognition of biomedical articles.
  • Using PubLayNet's pre-trained object detection model can help in identifying layout elements of health insurance documents.
  • The authors have disclosed the github URL of the dataset.
  • Future work: Use PMCOA as a data acquisition source to provide large data sets for other document analysis problems.

appendix

Some English words in the paper

  • parse: Language analysis.
  • ubiquitous: Ubiquitous.
  • heuristic: Heuristic.
  • schema: Overview.
  • distinctive:独特的。
  • redundancy:冗余。
  • heterogeneity: Differentiality.
  • partition:分割。
  • prestigious: prestigious.
  • moderate: Not serious.
  • discrepancy: Discrepancy.
  • inline: Inline.
  • Levenshtein distance: Levenshtein distance, used to measure the distance between two strings.
  • miscellaneous: Various.
  • polygon: Polygon.
  • fuzzy: imitation.
  • canonical: Classical.
  • aggregate: 总计.
  • placement:安置。
  • Creative Commons license: Creative Commons License.

Some other proper nouns in the paper

  • XML: Extensible Markup Language, a markup language for encoding documents. The XML file format is a structure and content description method used to store and transmit data. It is widely used in various fields, including network transmission, data storage, configuration files, etc. The basic structure of XML files includes tags and elements. Tags are used to define elements and usually appear in pairs, including a start tag and an end tag. The element is the data surrounded by the label.
  • PubMed Central (PMC): A full-text database of biomedical and life science journals established by the National Library of Medicine (NLM) under the National Institutes of Health (NIH). The database is developed and maintained by the National Center for Biotechnology Information (NCBI), a subsidiary of NLM. PMC has been open to the global public free of charge since February 2000.
  • PDFMiner: A full-text database of biomedical and life science journals established by the National Library of Medicine (NLM) under the National Institutes of Health (NIH). The database is developed and maintained by the National Center for Biotechnology Information (NCBI), a subsidiary of NLM. PMC has been open to the global public free of charge since February 2000.

Guess you like

Origin blog.csdn.net/hanmo22357/article/details/134517285