The open source China community team made its first live broadcast, telling the story behind the open source China community in the name of sharing."

Editor's note: Currently, the Retrieval Enhanced Generation (RAG) system has become one of the key technologies for empowering massive knowledge into large models. However, how to efficiently process semi-structured and unstructured data, especially tabular data in documents, is still a major problem faced by RAG systems.

The author of this article proposes a novel solution for processing tabular data to address this pain point. The author first systematically sorts out the core technologies of table processing in the RAG system, including table parsing, index structure design, etc., and reviews some existing open source solutions. On this basis, the author proposed his own innovation - using the Nougat tool to accurately and efficiently parse the table content in the document, using the language model to summarize the table and its title, and finally building a new document summary index structure , and gives complete code implementation details.

The advantage of this method is that it can effectively parse the table and fully consider the relationship between the table summary and the table. It does not require the use of multi-modal LLM and can save parsing costs. Let us wait and see the further application and development of this scheme in practice.

Author | Florian June

Compiled | Yue Yang

The implementation of a RAG system is a challenging task, especially when tables in unstructured documents need to be parsed and understood. For documents that have been digitized by scanning operations (scanned documents) or documents in image format (documents in image format), it is even more difficult to implement these operations. There are at least three challenges:

Documents digitized by scanning operations (scanned documents) or documents in image format (documents in image format) are relatively complex , such as the diversity of document structures, the document may contain some non-text elements, and the document may simultaneously The presence of handwritten and printed content will bring challenges to the accurate and automated extraction of form information. Inaccurate document parsing will destroy the table structure. Converting incomplete table information into vector representation (embedding) not only cannot effectively capture the semantic information of the table, but can also easily cause problems in the final output of RAG.
How to extract the titles of each table and associate them with the specific table they correspond to.
How to efficiently organize and store key semantic information in tables through reasonable index structure design.

This article first introduces how to manage and process tabular data in the Retrieval Augmented Generation (RAG) model. Then some existing open source solutions are reviewed, and finally a novel tabular data management method is designed and implemented based on current technology.

01 Introduction to core technologies related to RAG table data

1.1 Table Parsing Parsing of table data

The main function of this module is to accurately extract table structures from unstructured documents or images.

Additional requirements: It is best to extract the corresponding table title to facilitate developers to associate the table title with the table.

According to my current understanding, there are several methods, as shown in Figure 1:

Figure 1: Table parser. Picture provided by the original author.

(a).Use multi-modal LLM (such as GPT-4V[1]) to recognize tables and extract information from each PDF page.

Input: PDF page in image format
Output: Tabular data in JSON or other formats. If multimodal LLM is unable to extract tabular data, it should summarize the PDF image and return a summary of the content.

(b). Use professional table detection models (such as Table Transformer[2]) to identify table structures.

Input: PDF page image
Output: table image

(c). Use open source frameworks, such as unstructured[3] or other frameworks that also use object detection models (this article[4] details the table detection process of unstructured). These frameworks can fully parse the entire document and extract table-related content from the parsed results.

Input: Document in PDF or image format
Output: Table in plain text or HTML format (obtained from parsing the entire document)

(d). Use end-to-end models such as Nougat[5] and Donut[6] to parse the entire document and extract table-related content. This approach does not require an OCR model.

Input: Document in PDF or image format
Output: Table in LaTeX or JSON format (obtained from parsing the entire document)

It should be noted that no matter which method is used to extract table information, the table title should also be extracted. Because in most cases, the table title is a brief description of the table by the document author or paper author, which can summarize the contents of the entire table to a large extent.

Among the above four methods, method (d) can more conveniently retrieve table titles. This is a huge benefit to developers as they can associate table titles with tables. The following experiments will further illustrate this.

1.2 How Index Structure indexes tabular data

There are roughly the following types of indexing methods:

(e). Only index tables in image format.

(f). Only index tables in plain text or JSON format.

(g). Only index tables in LaTeX format.

(h). Only the abstract of the table is indexed.

(i). Small-to-big (Translator's Note: It includes both fine-grained indexing, such as indexing each row or table summary, and coarse-grained indexing, such as indexing the entire table of images, plain text, or LaTeX type data, Form a hierarchical, small to large index structure.) Or use table summary to build an index structure, as shown in Figure 2.

The content of the small chunk (Translator's Note: Data block corresponding to the fine-grained index level), such as treating each row of the table or summary information as an independent small data block.

The content of the big chunk (Translator's Note: The data block corresponding to the coarse-grained index level) may be an entire table in image format, plain text format, or LaTeX format.

Figure 2: Indexing small-to-big (top) and using table summaries (middle, bottom). Picture provided by the original author.

As mentioned above, tabular summaries are typically generated using LLM processing:

Input: image format, text format, or table in LaTeX format
Output: table summary

1.3 An approach that does not require parsing tables, building indexes, or using RAG technology

Some algorithms do not require parsing of tabular data.

(j). Send the relevant image (PDF document page) and the user's query to the VQA model (such as DAN [7], etc.) (Translator's Note: Short for Visual Question Answering model. It is a combination of computer Models of vision and natural language processing techniques that can be used to answer natural language questions about image content) or multi-modal LLM and return answers.

Content to be indexed: Documents in image format
What to send to the VQA model or multimodal LLM: Query + corresponding documentation page as image

(k). Send the PDF page in relevant text format and the user's query to LLM, and then return the answer.

Content to be indexed: Text format documents
Content sent to LLM: Query + corresponding documentation page in text format

(l). Send relevant document images (PDF document pages), text blocks and the user's Query to multi-modal LLM (such as GPT-4V, etc.), and then directly return the answer.

Content to be indexed: Documents in image format and document chunks in text format
Content sent to multimodal LLM: Query + document in corresponding image format + corresponding text chunks

Additionally, here are some methods that don't require indexing, as shown in Figures 3 and 4:

Figure 3: Category (m) (Translator’s Note: Content introduced in the first paragraph below). Picture provided by the original author.

(m). First, parse all tables in the document into image form using any of the methods from (a) to (d). Then, all table images and the user's query are sent directly to a multi-modal LLM (such as GPT-4V, etc.) and the answer is returned.

Content to be indexed: None
Content sent to multimodal LLM: Query + all tables that have been converted to image format

Figure 4: Ctegory (n) (Translator’s Note: Content introduced in the first paragraph below). Picture provided by the original author.

(n). Use the table in the image format extracted by the (m) method, and then use the OCR model to identify all the text in the table, and then directly send all the text in the table and the user's Query to LLM, and directly return the answer.

Content to be indexed: None
Content sent to LLM: User's Query + all table contents (sent in text format)

It is worth noting that when processing tables in documents, some methods do not use RAG (Retrieval-Augmented Generation) technology:

The first type of method does not use LLM, but trains on a specific data set, so that AI models (such as other language models based on the Transformer architecture and inspired by BERT) can better support the processing of table understanding tasks, such as TAPAS [8 ].
The second type of method is to use LLM, using pre-training, fine-tuning methods or prompt word engineering, so that LLM can complete table understanding tasks, such as GPT4Table [9].

02 Existing open source solutions for table processing

The previous section summarized and classified the key technologies for tabular data processing in RAG systems. Before proposing the solution we will implement in this article, let's explore some open source solutions.

LlamaIndex proposes four methods [10], of which the first three all use multimodal models.

Retrieve the relevant PDF page image and send it to GPT-4V in response to the user Query .
Convert each PDF page to image format and let GPT-4V perform image reasoning on each page. Establish a Text Vector Store index for the image reasoning process (Translator's Note: Convert the text information inferred from the image into vector form and create an index), and then use the Image Reasoning Vector Store (Translator's Note: It should be the previous index, Query the Text Vector Store index created previously) to find the answer.
Use Table Transformer to crop table information from the retrieved images, and then send these cropped table images to GPT-4V to obtain query responses (Translator's Note: Send Query to the model and get the answers returned by the model).
Apply OCR on the cropped table image and send the data to GPT4/GPT-3.5 to answer the user's query.

To summarize the above four methods:

The first method is similar to method (j) introduced in this article and does not require table parsing. But it turns out that even though the answer is right there in the image, it doesn't produce the correct answer.
The second method involves the parsing of tables and corresponds to method (a). The index content may be tabular content or content summaries, depending entirely on the results returned by GPT-4V, which may correspond to method (f) or (h). The disadvantage of this approach is that GPT-4V's ability to identify tables and extract their contents from document images is inconsistent, especially when the document image contains tables, text, and other images (which is common in PDF documents) .
The third method is similar to method (m) and does not require indexing.
The fourth method is similar to method (n) and also does not require indexing. The results showed that the reason for the wrong answers was the inability to effectively extract tabular information from the images.

Through testing, it was found that the third method has the best overall effect. However, according to the tests I conducted, the third method struggled to detect the table, let alone correctly extract and associate the table title and table content.

Langchain also proposed some solutions for semi-structured data RAG (Semi-structured RAG) [11] technology. The core technologies include:

Use unstructured for table parsing, which is a class (c) method.
The index method is document summary index (Translator's Note: Use document summary information as index content), which belongs to class (i) method. The data block corresponding to the fine-grained index level: table summary content, and the data block corresponding to the coarse-grained index level: original table content (text format).

As shown in Figure 5:

Figure 5: Langchain’s Semi-structured RAG solution. Source: Semi-structured RAG[11]

Semi-structured and Multi-modal RAG [12] proposed three solutions, whose architecture is shown in Figure 6.

Figure 6: Langchain’s semi-structured and multi-modal RAG scheme. Source: Semi-structured and Multi-modal RAG[12].

Option 1 is similar to method (l) above. This approach involves using multimodal embeddings (such as CLIP [13]) to convert images and text into embedding vectors, then using a similarity search algorithm to retrieve both, and convert the unprocessed Image and text data are passed to the multimodal LLM, allowing them to be processed together and generate answers to questions.

Option 2 uses multi-modal LLM (such as GPT-4V[14], LLaVA[15] or FUYU-8b[16]) to process the image to generate text summaries. The text data is then converted into embedding vectors, and these vectors are used to search or retrieve text content that matches the query posed by the user, and passed to the LLM to generate answers.

Table data is parsed using unstructured, which belongs to class (d) method.
The indexing method is document summary index (Translator's Note: document summary information is used as index content), which belongs to the (i) class method. The data block corresponding to the fine-grained index level: table summary content, and the data block corresponding to the coarse-grained index level: text Format table content.

Option 3 uses multi-modal LLM (such as GPT-4V [14], LLaVA [15] or FUYU-8b [16]) to generate text summaries from image data, and then embedding these text summaries into vectors, using these embedding vectors, Image abstracts can be efficiently retrieved (retrieve). In each retrieved image abstract, a corresponding reference to the raw image (reference to the raw image) is retained. This belongs to the method (i) above. Finally, the unprocessed image data and text blocks are passed to the multimodal LLM in order to generate answers.

03 The solution proposed in this article

The key technologies and existing solutions are summarized, classified and discussed in the previous article. On this basis, we propose the following solution, as shown in Figure 7. For simplicity, some RAG modules such as Re-ranking and query rewriting are omitted from the figure.

Figure 7: The solution proposed in this article. Picture provided by the original author.

Table parsing technique: using Nougat ((d) class method). According to my testing, this tool's table detection capabilities are more effective than unstructured (a type (c) technique). In addition, Nougat can also extract table titles very well, making it very convenient to associate with tables.
Index structure for indexing and retrieving document summaries (methods of class (i)): the fine-grained index level contains tabular content summaries, and the coarse-grained index level contains corresponding tables in LaTeX format and table titles in text format. We use a multi-vector retriever[17] (Translator's Note: A retriever for retrieving content in a document summary index that can process multiple vectors at the same time in order to efficiently retrieve document summaries related to the Query.) to fulfill.
How to obtain a table content summary: Send the table and table title to LLM for content summarization.

The advantage of this method is that it can effectively parse the table and fully consider the relationship between the table summary and the table. Eliminates the need to use multimodal LLM, resulting in cost savings.

3.1 How Nougat works

Nougat [18] is developed based on the Donut [19] architecture. This approach uses algorithms that can automatically recognize text in an implicit manner without any OCR-related input or modules.

Figure 8: End-to-end architecture following Donut [19]. The Swin Transformer encoder takes a document image and converts it into latent embeddings (Translator's Note: the information of the image is encoded in a latent space), and then converts it into a sequence of tokens in an autoregressive manner. Source: Nougat: Neural Optical Understanding for Academic Documents.[18]

Nougat's ability to parse formulas is impressive[20], but its ability to parse tables is also exceptional. As shown in Figure 9, it can be associated with table titles, which is very convenient:

Figure 9: Nougat running results. The result file is in Mathpix Markdown format (can be opened through the vscode plug-in), and the table is presented in LaTeX format.

In a test I conducted on a dozen papers, I found that table titles were always fixed to the next row of the table. This consistency suggests this was no accident. Therefore, we are more interested in how Nougat achieves this functionality.

Given that this is an end-to-end model lacking intermediate results, its performance is likely to be heavily dependent on its training data.

Based on code analysis, the location and manner in which the table header section is stored appears to be consistent with (and immediately \end{table} following caption_parts ) the organizational format of the table in the training data.

def format_element(
    element: Element, keep_refs: bool = False, latex_env: bool = False
) -> List[str]:
 """
    Formats a given Element into a list of formatted strings.

    Args:
        element (Element): The element to be formatted.
        keep_refs (bool, optional): Whether to keep references in the formatting. Default is False.
        latex_env (bool, optional): Whether to use LaTeX environment formatting. Default is False.

    Returns:
        List[str]: A list of formatted strings representing the formatted element.
    """
 ...
 ...
 if isinstance(element, Table):
        parts = [
 "[TABLE%s]\n\begin{table}\n"
 % (str(uuid4())[:5] if element.id is None else ":" + str(element.id))
 ]
        parts.extend(format_children(element, keep_refs, latex_env))
        caption_parts = format_element(element.caption, keep_refs, latex_env)
        remove_trailing_whitespace(caption_parts)
        parts.append("\end{table}\n")
 if len(caption_parts) > 0:
            parts.extend(caption_parts + ["\n"])
        parts.append("[ENDTABLE]\n\n")
 return parts
 ...
 ...

3.2 Advantages and Disadvantages of Nougat

advantage:

Nougat can accurately parse sections that were difficult to parse with previous parsing tools, such as formulas and tables, into LaTeX source code.
The result of Nougat's parsing is a semi-structured document similar to Markdown.
Ability to easily obtain table titles and easily associate them with tables.

shortcoming:

Nougat's parsing speed is slow, which may cause difficulties in large-scale applications.
Since Nougat's training data set is basically scientific papers, this technique works well on documents with similar structures. Performance degrades when processing non-Latin text documents.
The Nougat model only trains on one page of a scientific paper at a time and lacks knowledge of other pages. This can lead to some inconsistencies in parsed content. Therefore, if the recognition effect is not good, you can consider splitting the PDF into separate pages and parsing them page by page.
Table parsing in two-column papers is not as good as in single-column papers.

3.3 Code implementation

First, install the relevant Python packages:

pip install langchain
pip install chromadb
pip install nougat-ocr

After the installation is complete, you need to check the version of the Python package:

langchain                                0.1.12
langchain-community                      0.0.28
langchain-core                           0.1.31
langchain-openai                         0.0.8
langchain-text-splitters                 0.0.1

chroma-hnswlib                           0.7.3
chromadb                                 0.4.24

nougat-ocr                               0.1.17

Set up a working environment and import packages:

import os
os.environ["OPENAI_API_KEY"] = "YOUR_OPEN_AI_KEY"

import subprocess
import uuid

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.storage import InMemoryStore
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings
from langchain_core.runnables import RunnablePassthrough

Download the paper "Attention Is All You Need" [21] to the path YOUR_PDF_PATH, run nougat to parse the PDF file, and obtain the table data in latex format and the table title in text format from the parsing results. Executing the program for the first time will download the necessary model files to the local environment.

def june_run_nougat(file_path, output_dir):
 # Run Nougat and store results as Mathpix Markdown
    cmd = ["nougat", file_path, "-o", output_dir, "-m", "0.1.0-base", "--no-skipping"]
    res = subprocess.run(cmd) 
 if res.returncode != 0:
 print("Error when running nougat.")
 return res.returncode
 else:
 print("Operation Completed!")
 return 0

def june_get_tables_from_mmd(mmd_path):
    f = open(mmd_path)
    lines = f.readlines()
    res = []
    tmp = []
    flag = ""
 for line in lines:
 if line == "\begin{table}\n":
            flag = "BEGINTABLE"
 elif line == "\end{table}\n":
            flag = "ENDTABLE"
 
 if flag == "BEGINTABLE":
            tmp.append(line)
 elif flag == "ENDTABLE":
            tmp.append(line)
            flag = "CAPTION"
 elif flag == "CAPTION":
            tmp.append(line)
            flag = "MARKDOWN"
 print('-' * 100)
 print(''.join(tmp))
            res.append(''.join(tmp))
            tmp = []

 return res

file_path = "YOUR_PDF_PATH"
output_dir = "YOUR_OUTPUT_DIR_PATH"

if june_run_nougat(file_path, output_dir) == 1:
 import sys
    sys.exit(1)

mmd_path = output_dir + '/' + os.path.splitext(file_path)[0].split('/')[-1] + ".mmd" 
tables = june_get_tables_from_mmd(mmd_path)

The function june_get_tables_from_mmd is used to extract all contents from an mmd file (from \begin{table} to \end{table}, but also including \end{table} the first line after), as shown in Figure 10.

Figure 10: Nougat running results. The result file is in Mathpix Markdown format (can be opened through the vscode plug-in), and the parsed table content is in latex format. The function of function june_get_tables_from_mmd is to extract the table information in the red box. Picture provided by the original author.

However, there is no official document stating that the table title must be placed below the table, or that the table should start with \begin{table} and end with \end{table}. Therefore, june_get_tables_from_mmd is a heuristic method.

The following are the results of table parsing of the PDF document:

Operation Completed!
----------------------------------------------------------------------------------------------------
\begin{table}
\begin{tabular}{l c c c} \hline \hline Layer Type & Complexity per Layer & Sequential Operations & Maximum Path Length \ \hline Self-Attention & (O(n^{2}\cdot d)) & (O(1)) & (O(1)) \ Recurrent & (O(n\cdot d^{2})) & (O(n)) & (O(n)) \ Convolutional & (O(k\cdot n\cdot d^{2})) & (O(1)) & (O(log_{k}(n))) \ Self-Attention (restricted) & (O(r\cdot n\cdot d)) & (O(1)) & (O(n/r)) \ \hline \hline \end{tabular}
\end{table}
Table 1: Maximum path lengths, per-layer complexity and minimum number of sequential operations for different layer types. (n) is the sequence length, (d) is the representation dimension, (k) is the kernel size of convolutions and (r) the size of the neighborhood in restricted self-attention.

----------------------------------------------------------------------------------------------------
\begin{table}
\begin{tabular}{l c c c c} \hline \hline \multirow{2}{*}{Model} & \multicolumn{2}{c}{BLEU} & \multicolumn{2}{c}{Training Cost (FLOPs)} \ \cline{2-5}  & EN-DE & EN-FR & EN-DE & EN-FR \ \hline ByteNet [18] & 23.75 & & & \ Deep-Att + PosUnk [39] & & 39.2 & & (1.0\cdot 10^{20}) \ GNMT + RL [38] & 24.6 & 39.92 & (2.3\cdot 10^{19}) & (1.4\cdot 10^{20}) \ ConvS2S [9] & 25.16 & 40.46 & (9.6\cdot 10^{18}) & (1.5\cdot 10^{20}) \ MoE [32] & 26.03 & 40.56 & (2.0\cdot 10^{19}) & (1.2\cdot 10^{20}) \ \hline Deep-Att + PosUnk Ensemble [39] & & 40.4 & & (8.0\cdot 10^{20}) \ GNMT + RL Ensemble [38] & 26.30 & 41.16 & (1.8\cdot 10^{20}) & (1.1\cdot 10^{21}) \ ConvS2S Ensemble [9] & 26.36 & **41.29** & (7.7\cdot 10^{19}) & (1.2\cdot 10^{21}) \ \hline Transformer (base model) & 27.3 & 38.1 & & (\mathbf{3.3\cdot 10^{18}}) \ Transformer (big) & **28.4** & **41.8** & & (2.3\cdot 10^{19}) \ \hline \hline \end{tabular}
\end{table}
Table 2: The Transformer achieves better BLEU scores than previous state-of-the-art models on the English-to-German and English-to-French newstest2014 tests at a fraction of the training cost.

----------------------------------------------------------------------------------------------------
\begin{table}
\begin{tabular}{c|c c c c c c c c|c c c c} \hline \hline  & (N) & (d_{\text{model}}) & (d_{\text{ff}}) & (h) & (d_{k}) & (d_{v}) & (P_{drop}) & (\epsilon_{ls}) & train steps & PPL & BLEU & params \ \hline base & 6 & 512 & 2048 & 8 & 64 & 64 & 0.1 & 0.1 & 100K & 4.92 & 25.8 & 65 \ \hline \multirow{4}{*}{(A)} & \multicolumn{1}{c}{} & & 1 & 512 & 512 & & & & 5.29 & 24.9 & \  & & & & 4 & 128 & 128 & & & & 5.00 & 25.5 & \  & & & & 16 & 32 & 32 & & & & 4.91 & 25.8 & \  & & & & 32 & 16 & 16 & & & & 5.01 & 25.4 & \ \hline (B) & \multicolumn{1}{c}{} & & \multicolumn{1}{c}{} & & 16 & & & & & 5.16 & 25.1 & 58 \  & & & & & 32 & & & & & 5.01 & 25.4 & 60 \ \hline \multirow{4}{*}{(C)} & 2 & \multicolumn{1}{c}{} & & & & & & & & 6.11 & 23.7 & 36 \  & 4 & & & & & & & & 5.19 & 25.3 & 50 \  & 8 & & & & & & & & 4.88 & 25.5 & 80 \  & & 256 & & 32 & 32 & & & & 5.75 & 24.5 & 28 \  & 1024 & & 128 & 128 & & & & 4.66 & 26.0 & 168 \  & & 1024 & & & & & & 5.12 & 25.4 & 53 \  & & 4096 & & & & & & 4.75 & 26.2 & 90 \ \hline \multirow{4}{*}{(D)} & \multicolumn{1}{c}{} & & & & & 0.0 & & 5.77 & 24.6 & \  & & & & & & 0.2 & & 4.95 & 25.5 & \  & & & & & & & 0.0 & 4.67 & 25.3 & \  & & & & & & & 0.2 & 5.47 & 25.7 & \ \hline (E) & \multicolumn{1}{c}{} & \multicolumn{1}{c}{} & & \multicolumn{1}{c}{} & & & & & 4.92 & 25.7 & \ \hline big & 6 & 1024 & 4096 & 16 & & 0.3 & 300K & **4.33** & **26.4** & 213 \ \hline \hline \end{tabular}
\end{table}
Table 3: Variations on the Transformer architecture. Unlisted values are identical to those of the base model. All metrics are on the English-to-German translation development set, newstest2013. Listed perplexities are per-wordpiece, according to our byte-pair encoding, and should not be compared to per-word perplexities.

----------------------------------------------------------------------------------------------------
\begin{table}
\begin{tabular}{c|c|c} \hline
**Parser** & **Training** & **WSJ 23 F1** \ \hline Vinyals & Kaiser et al. (2014) [37] & WSJ only, discriminative & 88.3 \ Petrov et al. (2006) [29] & WSJ only, discriminative & 90.4 \ Zhu et al. (2013) [40] & WSJ only, discriminative & 90.4 \ Dyer et al. (2016) [8] & WSJ only, discriminative & 91.7 \ \hline Transformer (4 layers) & WSJ only, discriminative & 91.3 \ \hline Zhu et al. (2013) [40] & semi-supervised & 91.3 \ Huang & Harper (2009) [14] & semi-supervised & 91.3 \ McClosky et al. (2006) [26] & semi-supervised & 92.1 \ Vinyals & Kaiser el al. (2014) [37] & semi-supervised & 92.1 \ \hline Transformer (4 layers) & semi-supervised & 92.7 \ \hline Luong et al. (2015) [23] & multi-task & 93.0 \ Dyer et al. (2016) [8] & generative & 93.3 \ \hline \end{tabular}
\end{table}
Table 4: The Transformer generalizes well to English constituency parsing (Results are on Section 23 of WSJ)* [5] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. _CoRR_, abs/1406.1078, 2014.

Then use LLM to summarize the tabular data:

# Prompt
prompt_text = """You are an assistant tasked with summarizing tables and text. \ 
Give a concise summary of the table or text. The table is formatted in LaTeX, and its caption is in plain text format: {element}  """
prompt = ChatPromptTemplate.from_template(prompt_text)

# Summary chain
model = ChatOpenAI(temperature = 0, model = "gpt-3.5-turbo")
summarize_chain = {"element": lambda x: x} | prompt | model | StrOutputParser()
# Get table summaries
table_summaries = summarize_chain.batch(tables, {"max_concurrency": 5})
print(table_summaries)

The following is a summary of the contents of the four tables in "Attention Is All You Need" [21], as shown in Figure 11:

Figure 11: Content summary of the four tables in "Attention Is All You Need" [21].

Use Multi-Vector Retriever (Translator's Note: A retriever for retrieving content in the document summary index. The retriever can process multiple vectors at the same time to effectively retrieve document summaries related to the Query.) Build a document summary index structure [17] (Translator's Note: An index structure used to store summary information of documents, and these summary information can be retrieved or queried as needed).

# The vectorstore to use to index the child chunks
vectorstore = Chroma(collection_name = "summaries", embedding_function = OpenAIEmbeddings())

# The storage layer for the parent documents
store = InMemoryStore()
id_key = "doc_id"

# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore = vectorstore,
    docstore = store,
    id_key = id_key,
    search_kwargs={"k": 1} # Solving Number of requested results 4 is greater than number of elements in index..., updating n_results = 1
)

# Add tables
table_ids = [str(uuid.uuid4()) for _ in tables]
summary_tables = [
    Document(page_content = s, metadata = {id_key: table_ids[i]})
 for i, s in enumerate(table_summaries)
]
retriever.vectorstore.add_documents(summary_tables)
retriever.docstore.mset(list(zip(table_ids, tables)))

Once everything is ready, set up a simple RAG pipeline and execute the user's queries:

# Prompt template
template = """Answer the question based only on the following context, which can include text and tables, there is a table in LaTeX format and a table caption in plain text format:
{context}
Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

# LLM
model = ChatOpenAI(temperature = 0, model = "gpt-3.5-turbo")


# Simple RAG pipeline
chain = (
 {"context": retriever, "question": RunnablePassthrough()}
 | prompt
 | model
 | StrOutputParser()
)


print(chain.invoke("when layer type is Self-Attention, what is the Complexity per Layer?")) # Query about table 1

print(chain.invoke("Which parser performs worst for BLEU EN-DE")) # Query about table 2

print(chain.invoke("Which parser performs best for WSJ 23 F1")) # Query about table 4

The running results are as follows. These questions have been answered accurately, as shown in Figure 12:

Figure 12: Answers to three user queries. The first row corresponds to table 1, the second row to table 2, and the third row to table 4 in Attention Is All You Need.

The overall code is as follows:

import os
os.environ["OPENAI_API_KEY"] = "YOUR_OPEN_AI_KEY"

import subprocess
import uuid

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.storage import InMemoryStore
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings
from langchain_core.runnables import RunnablePassthrough


def june_run_nougat(file_path, output_dir):
 # Run Nougat and store results as Mathpix Markdown
    cmd = ["nougat", file_path, "-o", output_dir, "-m", "0.1.0-base", "--no-skipping"]
    res = subprocess.run(cmd) 
 if res.returncode != 0:
 print("Error when running nougat.")
 return res.returncode
 else:
 print("Operation Completed!")
 return 0

def june_get_tables_from_mmd(mmd_path):
    f = open(mmd_path)
    lines = f.readlines()
    res = []
    tmp = []
    flag = ""
 for line in lines:
 if line == "\begin{table}\n":
            flag = "BEGINTABLE"
 elif line == "\end{table}\n":
            flag = "ENDTABLE"
 
 if flag == "BEGINTABLE":
            tmp.append(line)
 elif flag == "ENDTABLE":
            tmp.append(line)
            flag = "CAPTION"
 elif flag == "CAPTION":
            tmp.append(line)
            flag = "MARKDOWN"
 print('-' * 100)
 print(''.join(tmp))
            res.append(''.join(tmp))
            tmp = []

 return res

file_path = "YOUR_PDF_PATH"
output_dir = "YOUR_OUTPUT_DIR_PATH"

if june_run_nougat(file_path, output_dir) == 1:
 import sys
    sys.exit(1)

mmd_path = output_dir + '/' + os.path.splitext(file_path)[0].split('/')[-1] + ".mmd" 
tables = june_get_tables_from_mmd(mmd_path)


# Prompt
prompt_text = """You are an assistant tasked with summarizing tables and text. \ 
Give a concise summary of the table or text. The table is formatted in LaTeX, and its caption is in plain text format: {element}  """
prompt = ChatPromptTemplate.from_template(prompt_text)

# Summary chain
model = ChatOpenAI(temperature = 0, model = "gpt-3.5-turbo")
summarize_chain = {"element": lambda x: x} | prompt | model | StrOutputParser()
# Get table summaries
table_summaries = summarize_chain.batch(tables, {"max_concurrency": 5})
print(table_summaries)

# The vectorstore to use to index the child chunks
vectorstore = Chroma(collection_name = "summaries", embedding_function = OpenAIEmbeddings())

# The storage layer for the parent documents
store = InMemoryStore()
id_key = "doc_id"

# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore = vectorstore,
    docstore = store,
    id_key = id_key,
    search_kwargs={"k": 1} # Solving Number of requested results 4 is greater than number of elements in index..., updating n_results = 1
)

# Add tables
table_ids = [str(uuid.uuid4()) for _ in tables]
summary_tables = [
    Document(page_content = s, metadata = {id_key: table_ids[i]})
 for i, s in enumerate(table_summaries)
]
retriever.vectorstore.add_documents(summary_tables)
retriever.docstore.mset(list(zip(table_ids, tables)))


# Prompt template
template = """Answer the question based only on the following context, which can include text and tables, there is a table in LaTeX format and a table caption in plain text format:
{context}
Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

# LLM
model = ChatOpenAI(temperature = 0, model = "gpt-3.5-turbo")

# Simple RAG pipeline
chain = (
 {"context": retriever, "question": RunnablePassthrough()}
 | prompt
 | model
 | StrOutputParser()
)

print(chain.invoke("when layer type is Self-Attention, what is the Complexity per Layer?")) # Query about table 1

print(chain.invoke("Which parser performs worst for BLEU EN-DE")) # Query about table 2

print(chain.invoke("Which parser performs best for WSJ 23 F1")) # Query about table 4

04 Conclusion

This paper discusses key technologies and existing solutions for table processing operations in RAG systems and proposes a solution and its implementation.

In this article, we used Nougat to parse tables. However, we will consider replacing Nougat if a faster and more efficient parsing tool becomes available. Our attitude towards tools is to first have the right idea and then find tools to implement it, rather than relying on a certain tool.

In this article, we input all table contents into LLM. However, in real scenarios, we need to take into account the situation where the table size exceeds the LLM context length. We can solve this problem by using efficient chunking methods.

Thanks for reading!

——

Florian June

An artificial intelligence researcher, mainly write articles about Large Language Models, data structures and algorithms, and NLP.

END

References

[1]https://openai.com/research/gpt-4v-system-card

[2]https://github.com/microsoft/table-transformer

[3]https://unstructured-io.github.io/unstructured/best_practices/table_extraction_pdf.html

[4]https://pub.towardsai.net/advanced-rag-02-unveiling-pdf-parsing-b84ae866344e

[5]https://github.com/facebookresearch/nougat

[6]https://github.com/clovaai/donut/

[7]https://arxiv.org/pdf/1611.00471.pdf

[8]https://aclanthology.org/2020.acl-main.398.pdf

[9]https://arxiv.org/pdf/2305.13062.pdf

[10]https://docs.llamaindex.ai/en/stable/examples/multi_modal/multi_modal_pdf_tables.html

[11]https://github.com/langchain-ai/langchain/blob/master/cookbook/Semi_Structured_RAG.ipynb?ref=blog.langchain.dev

[12]https://github.com/langchain-ai/langchain/blob/master/cookbook/Semi_structured_and_multi_modal_RAG.ipynb

[13]https://openai.com/research/clip

[14]https://openai.com/research/gpt-4v-system-card

[15] https://llava.hliu.cc/

[16] https://www.adept.ai/blog/fuyu-8b

[17]https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector

[18]https://arxiv.org/pdf/2308.13418.pdf

[19]https://arxiv.org/pdf/2111.15664.pdf

[20]https://medium.com/@florian_algo/unveiling-pdf-parsing-how-to-extract-formulas-from-scientific-pdf-papers-a8f126f3511d

[21]https://arxiv.org/pdf/1706.03762.pdf

This article was compiled by Baihai IDP with the authorization of the original author. If you need to reprint the translation, please contact us for authorization.

Original link:

https://ai.plainenglish.io/advanced-rag-07-exploring-rag-for-tables-5c3fc0de7af6

Advanced RAG 07: New ideas for tabular data processing in RAG systems