Article directory
mission details
Convert the pdf file named "General Institutions of Higher Education Undergraduate Major Catalog.pdf" into a csv file. Each page of this pdf is a table with 7 columns.
Below are the first and second pages of the pdf:
The pdf link of the 2023 undergraduate major catalog of ordinary colleges and universities
Link: https://pan.baidu.com/s/14acDd_L7AtXgKyq_LAV5Rg?pwd=hmne Extraction code: hmne
the code
There are several libraries in Python that can be used to do this. The following is a sample code that uses pdfplumber
to read a PDF file and pandas
save the data to a csv file using
Note: This code assumes that the tables in the pdf file are clearly identifiable by lines, and this method may not work for all pdf files, because the structure and content of the pdf may have an impact on parsing.
import pdfplumber
import pandas as pd
def convert_pdf_to_csv(pdf_file):
# 打开pdf文件
with pdfplumber.open(pdf_file) as pdf:
# 初始化一个空的DataFrame用于存储所有页的数据
df_all = pd.DataFrame()
# 遍历每一页
for page in pdf.pages:
# 提取表格数据
tables = page.extract_tables()
for table in tables:
# 将表格数据转化为DataFrame
df = pd.DataFrame(table[1:], columns=table[0])
df_all = pd.concat([df_all, df])
# 保存为csv文件
df_all.to_csv(pdf_file.replace('.pdf', '.csv'), index=False)
# 使用函数
convert_pdf_to_csv("普通高等学校本科专业目录.pdf")
The above code will iterate through each page in the PDF and extract the tabular data from each page. The data will be combined and saved into a CSV file. The final CSV file name will be the same as the input PDF file name, but with a ".csv" extension.
code analysis
This code is run in the Python environment. First, it needs to import two libraries: pdfplumber
and pandas
.
-
import pdfplumber
:pdfplumber
is a Python library for extracting text, tables and metadata in PDFs. -
import pandas as pd
:pandas
is a Python data analysis library, which is mainly used here to create and manipulate DataFrame (a two-dimensional labeled data structure), and save DataFrame as a csv file.
Then, define a convert_pdf_to_csv
function named , which accepts one parameter pdf_file
representing the name of the PDF file to be converted.
-
with pdfplumber.open(pdf_file) as pdf
: The function to usepdfplumber
toopen
open the PDF file.with
The statement is used to wrap the context of the file operation to ensure that the file is properly closed after the operation is completed. -
df_all = pd.DataFrame()
: Initialize an empty DataFrame to store the tabular data extracted from each page of the PDF. -
for page in pdf.pages
: Iterate through each page of the PDF. -
tables = page.extract_tables()
: The function usedpdfplumber
toextract_tables
extract the table data from each page, this returns a list, each element represents a table in a page, the table itself is also a list, which contains the row data. -
for table in tables
: For each extracted table. -
df = pd.DataFrame(table[1:], columns=table[0])
: Convert tabular data to a DataFrame. Here it is assumed that the first row of the table is the column name, so we use ittable[0]
as the column name and the resttable[1:]
as the data. -
df_all = pd.concat([df_all, df])
: Merge the current page's DataFrame (df
) with the total DataFrame (df_all
). -
df_all.to_csv(pdf_file.replace('.pdf', '.csv'), index=False)
: Save the final DataFrame as a csv file. The file name is obtained by replacing the '.pdf' part of the original PDF file name with '.csv'. The parameterindex=False
indicates that the index is not included when saving the csv.
The last line is to call this function to convert the specified PDF file into a CSV file.