python converts pdf to pictures in batches | converts pdf to jpg/png | converts word to pdf in batches | python converts word to pictures in batches

This article introduces a method to do this using a Python script, which is implemented based on PyMuPDF, pdf2image and win32com libraries, which can help you quickly convert Word documents to PDF files and convert PDF files to PNG images.

1. Install the required libraries and software

Before you can start using this script, you need to install the required Python modules and associated software. Specifically, you need to install the three libraries PyMuPDF, pdf2image and win32com, as well as the Microsoft Word software and the Poppler program. Python modules can be installed with the following command:

pip install PyMuPDF 
pip install pdf2image 
pip install pypiwin32

The PyMuPDF library is used to convert PDF files to images; the pdf2image library is used to convert PDF files to PNG images; the pypiwin32 library is used to interact with Microsoft Word software and convert documents. In addition, you need to download and install the Microsoft Word software, and add the path of the Poppler program to the system environment variable.

2. Load and convert Word document

After you have installed the required libraries and software, you are ready to start using the script. The script is mainly divided into three parts, which are converting Word documents to docx files, converting docx files to PDF files, and converting PDF files to PNG images.

First, we need to load and convert the Word document. In the code, we use the os.listdir() method to read all the file names under the specified path, and then use the split() method to split the file name into file name and file suffix according to ".". If the file suffix is ​​"doc", the Word application is opened using the win32com library and a new Word application is created from the document object using the Dispatch() method. Then, open the current file under the corresponding path and convert it to a docx format file. Finally, close the Word application and wait for 3 seconds for the system to release resources.

for i in os.listdir(path):
    file_name,file_suffix = i.split(".") 
    if file_suffix == "doc":
        word = Dispatch('Word.Application')
        doc = word.Documents.Open(path+f"{
      
      i}")
        # 将 Word 文档转换为 docx 格式文件
        doc.SaveAs(path+f"{
      
      file_name}.docx",FileFormat=12)
        print(i,"转换完成")
        doc.Close()
        word.Quit()
        sleep(3)        

The SaveAs() method accepts two parameters, which are the output file path and the output file format. Among them, the FileFormat parameter is used to specify the format of the output file, and 12 indicates that the output is a docx format file

3. Load and convert pdf documents

Modify the appeal code to:

# 将 Word 文档转换为 docx 格式文件
        doc.SaveAs(path+f"{
      
      file_name}.docx",FileFormat=17)

The FileFormat parameter is used to specify the format of the output file, and 17 means the output is a PDF format file.

4. Convert PDF files to PNG images

After completing the conversion of the PDF file, we can convert it to a PNG image. In the code, we use the PyMuPDF library to open the PDF file under the specified path and get the total number of pages of the file. Then, use the convert_from_path() method in the pdf2image library to iterate through each page in the PDF and convert it to a PNG image. Finally, save the PNG image to the specified path and output the conversion progress.

for filename in os.listdir(path):
    if filename.endswith(".pdf"):
        # 获取当前 PDF 文件的总页数
        doc = fitz.open(path + filename)
        total_pages = doc.page_count
        doc.close()
        
        print(f"正在转换 {
      
      filename},共 {
      
      total_pages} 页...")
        for i, page in enumerate(convert_from_path(path + filename, grayscale=False), start=1):
            # 构造文件名
            output_filename = os.path.splitext(filename)[0] + "_" + str(i) + ".png"
            # 保存图片
            page.save(path_images + output_filename, "png")
            # 输出转换进度
            print(f"已完成第 {
      
      i}/{
      
      total_pages} 页的转换")

Five, complete code

# -*- coding: utf-8 -*-
"""
Created on Wed May 31 17:10:27 2023

@author: ypzhao
"""

import os
import fitz
from pdf2image import convert_from_path
from time import sleep
from win32com.client import Dispatch

# 定义PDF文件路径和输出区间路径
# 待转换pdf文件路径
path = "C:/Users/ypzhao/Desktop/pdf/"
# doc/docx转换后的路径
path_convert = "C:/Users/ypzhao/Desktop/pdf/"
# 转换后的图片路径
path_images = "C:/Users/ypzhao/Desktop/images/"
print("-----doc开始转换为docx-----")

for i in os.listdir(path):
    file_name,file_suffix = i.split(".") 
    if file_suffix == "doc":
        word = Dispatch('Word.Application')
        doc = word.Documents.Open(path+f"{
      
      i}")
        doc.SaveAs(path+f"{
      
      file_name}.docx",FileFormat=12)
        print(i,"转换完成")
        doc.Close()
        word.Quit()
        sleep(3)

print("-----开始转换为pdf-----")
for i in os.listdir(path):
    file_name,file_suffix = i.split(".") 
    if file_suffix == "docx":
        word = Dispatch('Word.Application')
        doc = word.Documents.Open(path+f"{
      
      i}")
        doc.SaveAs(path_convert+f"{
      
      file_name}.pdf",FileFormat=17)
        print(i,"...转换完成")
        doc.Close()
        word.Quit()
        sleep(3)
    else:
        pass


# 循环遍历PDF文件,并转换为图片
for filename in os.listdir(path):
    if filename.endswith(".pdf"):
        # 获取当前 PDF 文件的总页数
        doc = fitz.open(path + filename)
        total_pages = doc.page_count
        doc.close()
        
        print(f"正在转换 {
      
      filename},共 {
      
      total_pages} 页...")
        #按照原图输出pdf文件为word
        #若想黑白输出pdf为图片格式,修改grayscale=False为grayscale=True
        for i, page in enumerate(convert_from_path(path + filename, grayscale=False), start=1):
            # 构造文件名
            output_filename = os.path.splitext(filename)[0] + "_" + str(i) + ".png"
            # 保存图片
            page.save(path_images + output_filename, "png")
            # 输出转换进度
            print(f"已完成第 {
      
      i}/{
      
      total_pages} 页的转换")

print("-----已完成所有转换-----")

operation result

insert image description here

Guess you like

Origin blog.csdn.net/m0_58857684/article/details/130974124