Convert the pdf file named "Catalogue of Undergraduate Majors of Ordinary Colleges and Universities.pdf" into a csv file

mission details

Convert the pdf file named "General Institutions of Higher Education Undergraduate Major Catalog.pdf" into a csv file. Each page of this pdf is a table with 7 columns.

Below are the first and second pages of the pdf:
insert image description here

insert image description here

The pdf link of the 2023 undergraduate major catalog of ordinary colleges and universities

Link: https://pan.baidu.com/s/14acDd_L7AtXgKyq_LAV5Rg?pwd=hmne Extraction code: hmne

the code

There are several libraries in Python that can be used to do this. The following is a sample code that uses pdfplumberto read a PDF file and pandassave the data to a csv file using

Note: This code assumes that the tables in the pdf file are clearly identifiable by lines, and this method may not work for all pdf files, because the structure and content of the pdf may have an impact on parsing.

import pdfplumber
import pandas as pd

def convert_pdf_to_csv(pdf_file):
    # 打开pdf文件
    with pdfplumber.open(pdf_file) as pdf:
        # 初始化一个空的DataFrame用于存储所有页的数据
        df_all = pd.DataFrame()

        # 遍历每一页
        for page in pdf.pages:
            # 提取表格数据
            tables = page.extract_tables()
            for table in tables:
                # 将表格数据转化为DataFrame
                df = pd.DataFrame(table[1:], columns=table[0])
                df_all = pd.concat([df_all, df])

        # 保存为csv文件
        df_all.to_csv(pdf_file.replace('.pdf', '.csv'), index=False)

# 使用函数
convert_pdf_to_csv("普通高等学校本科专业目录.pdf")

The above code will iterate through each page in the PDF and extract the tabular data from each page. The data will be combined and saved into a CSV file. The final CSV file name will be the same as the input PDF file name, but with a ".csv" extension.

code analysis

This code is run in the Python environment. First, it needs to import two libraries: pdfplumberand pandas.

  1. import pdfplumber: pdfplumberis a Python library for extracting text, tables and metadata in PDFs.

  2. import pandas as pd: pandasis a Python data analysis library, which is mainly used here to create and manipulate DataFrame (a two-dimensional labeled data structure), and save DataFrame as a csv file.

Then, define a convert_pdf_to_csvfunction named , which accepts one parameter pdf_filerepresenting the name of the PDF file to be converted.

  1. with pdfplumber.open(pdf_file) as pdf: The function to use pdfplumberto openopen the PDF file. withThe statement is used to wrap the context of the file operation to ensure that the file is properly closed after the operation is completed.

  2. df_all = pd.DataFrame(): Initialize an empty DataFrame to store the tabular data extracted from each page of the PDF.

  3. for page in pdf.pages: Iterate through each page of the PDF.

  4. tables = page.extract_tables(): The function used pdfplumberto extract_tablesextract the table data from each page, this returns a list, each element represents a table in a page, the table itself is also a list, which contains the row data.

  5. for table in tables: For each extracted table.

  6. df = pd.DataFrame(table[1:], columns=table[0]): Convert tabular data to a DataFrame. Here it is assumed that the first row of the table is the column name, so we use it table[0]as the column name and the rest table[1:]as the data.

  7. df_all = pd.concat([df_all, df]): Merge the current page's DataFrame ( df) with the total DataFrame ( df_all).

  8. df_all.to_csv(pdf_file.replace('.pdf', '.csv'), index=False): Save the final DataFrame as a csv file. The file name is obtained by replacing the '.pdf' part of the original PDF file name with '.csv'. The parameter index=Falseindicates that the index is not included when saving the csv.

The last line is to call this function to convert the specified PDF file into a CSV file.

run screenshot

insert image description here

Guess you like

Origin blog.csdn.net/Waldocsdn/article/details/131660931