20230507 Use python3 to batch convert DOCX documents to TXT

20230507 Use python3 to batch convert DOCX documents to TXT
2023/5/7 20:22

WIN10 uses python3.11

# – coding: gbk –
import os
from pdf2docx import Converter
from win32com import client as wc
"""An subcontract pywin32com is required here"""

# Read the text content of the pdf file
def DocxToTxt(inputFinallyPath, outputFinallyPath):
    wordhandle = wc.Dispatch("Word.Application")
    wordhandle.Visible = 0 # Run in the background, do not display
    wordhandle.DisplayAlerts = 0 # Do not warn
    doc = wordhandle. Documents.Open(inputFinallyPath)
    doc.SaveAs(outputFinallyPath, 4) # txt=4, html=10, docx=16,pdf=17
    doc.Close


if __name__ == '__main__':

        # Input path
        inputPath = r'D:\pythonproject\pdf_to_txt\input'
        #Output path, preferably an absolute path
        outputPath = r'D:\pythonproject\pdf_to_txt\output'
      
        # List the files in the folder
        pdfList = os. listdir(inputPath)
        # Batch read storage
        pdf_num = 1
        for li in pdfList:
            print(li)
            inputFinallyPath = inputPath + '/' + li
            li = li.replace('.docx', '.txt')
            outputFinallyPath = outputPath + '/' + li
            DocxToTxt(inputFinallyPath, outputFinallyPath)
            print('%d docx has been converted to txt' % pdf_num)
            pdf_num = pdf_num + 1
        print('A total of %d docx articles have been completely converted to txt' % (pdf_num-1))

 


Use google translate to translate 88 Japanese DOCX subtitles into Simplified Chinese version!
Microsoft Windows [Version 10.0.19044.2728]
(c) Microsoft Corporation. all rights reserved.

C:\Users\QQ>python3

C:\Users\QQ>python

C:\Users\QQ>python
Python 3.11.3 (tags/v3.11.3:f3909b8, Apr  4 2023, 23:49:59) [MSC v.1934 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> from pdf2docx import Converter
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'pdf2docx'
>>>

 


Microsoft Windows [Version 10.0.19044.2728]
(c) Microsoft Corporation. all rights reserved.

C:\Users\QQ>pip install pdf2docx
Collecting pdf2docx
  Downloading pdf2docx-0.5.6-py3-none-any.whl (148 kB)
     ---------------------------------------- 148.4/148.4 kB 368.3 kB/s eta 0:00:00
Collecting PyMuPDF>=1.19.0
  Downloading PyMuPDF-1.22.2-cp311-cp311-win_amd64.whl (11.7 MB)
     ---------------------------------------- 11.7/11.7 MB 12.8 MB/s eta 0:00:00
Collecting python-docx>=0.8.10
  Downloading python-docx-0.8.11.tar.gz (5.6 MB)
     ---------------------------------------- 5.6/5.6 MB 1.6 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
Collecting fonttools>=4.24.0
  Downloading fonttools-4.39.3-py3-none-any.whl (1.0 MB)
     ---------------------------------------- 1.0/1.0 MB 12.8 MB/s eta 0:00:00
Collecting numpy>=1.17.2
  Downloading numpy-1.24.3-cp311-cp311-win_amd64.whl (14.8 MB)
     ---------------------------------------- 14.8/14.8 MB 21.1 MB/s eta 0:00:00
Collecting opencv-python>=4.5
  Downloading opencv_python-4.7.0.72-cp37-abi3-win_amd64.whl (38.2 MB)
     ---------------------------------------- 38.2/38.2 MB 12.6 MB/s eta 0:00:00
Collecting fire>=0.3.0
  Downloading fire-0.5.0.tar.gz (88 kB)
     ---------------------------------------- 88.3/88.3 kB 4.9 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
Collecting six
  Downloading six-1.16.0-py2.py3-none-any.whl (11 kB)
Collecting termcolor
  Downloading termcolor-2.3.0-py3-none-any.whl (6.9 kB)
Collecting lxml>=2.3.2
  Downloading lxml-4.9.2-cp311-cp311-win_amd64.whl (3.8 MB)
     ---------------------------------------- 3.8/3.8 MB 10.0 MB/s eta 0:00:00
Installing collected packages: termcolor, six, PyMuPDF, numpy, lxml, fonttools, python-docx, opencv-python, fire, pdf2docx
  WARNING: The script f2py.exe is installed in 'C:\Users\QQ\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\Scripts' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
  WARNING: The scripts fonttools.exe, pyftmerge.exe, pyftsubset.exe and ttx.exe are installed in 'C:\Users\QQ\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\Scripts' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
  DEPRECATION: python-docx is being installed using the legacy 'setup.py install' method, because it does not have a 'pyproject.toml' and the 'wheel' package is not installed. pip 23.1 will enforce this behaviour change. A possible replacement is to enable the '--use-pep517' option. Discussion can be found at https://github.com/pypa/pip/issues/8559
  Running setup.py install for python-docx ... done
  DEPRECATION: fire is being installed using the legacy 'setup.py install' method, because it does not have a 'pyproject.toml' and the 'wheel' package is not installed. pip 23.1 will enforce this behaviour change. A possible replacement is to enable the '--use-pep517' option. Discussion can be found at https://github.com/pypa/pip/issues/8559
  Running setup.py install for fire ... done
  WARNING: The script pdf2docx.exe is installed in 'C:\Users\QQ\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\Scripts' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
Successfully installed PyMuPDF-1.22.2 fire-0.5.0 fonttools-4.39.3 lxml-4.9.2 numpy-1.24.3 opencv-python-4.7.0.72 pdf2docx-0.5.6 python-docx-0.8.11 six-1.16.0 termcolor-2.3.0

[notice] A new release of pip available: 22.3.1 -> 23.1.2
[notice] To update, run: C:\Users\QQ\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip

C:\Users\QQ>

 

 


Microsoft Windows [Version 10.0.19044.2728]
(c) Microsoft Corporation. all rights reserved.

C:\Users\QQ>pip install win32com
ERROR: Could not find a version that satisfies the requirement win32com (from versions: none)
ERROR: No matching distribution found for win32com

[notice] A new release of pip available: 22.3.1 -> 23.1.2
[notice] To update, run: C:\Users\QQ\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip

C:\Users\QQ>
C:\Users\QQ>pip install pypwin32
ERROR: Could not find a version that satisfies the requirement pypwin32 (from versions: none)
ERROR: No matching distribution found for pypwin32

[notice] A new release of pip available: 22.3.1 -> 23.1.2
[notice] To update, run: C:\Users\QQ\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip

C:\Users\QQ>
C:\Users\QQ>pip install  pypiwin32
Collecting pypiwin32
  Downloading pypiwin32-223-py3-none-any.whl (1.7 kB)
Collecting pywin32>=223
  Downloading pywin32-306-cp311-cp311-win_amd64.whl (9.2 MB)
     ---------------------------------------- 9.2/9.2 MB 895.2 kB/s eta 0:00:00
Installing collected packages: pywin32, pypiwin32
Successfully installed pypiwin32-223 pywin32-306

[notice] A new release of pip available: 22.3.1 -> 23.1.2
[notice] To update, run: C:\Users\QQ\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip

C:\Users\QQ>
C:\Users\QQ>

 

 

 


Microsoft Windows [Version 10.0.19044.2728]
(c) Microsoft Corporation. all rights reserved.

C:\Users\QQ>d:

D:\>dir *.pty
 The volume in drive D is
 the serial number of the DATA volume is 547F-1046

 D:\ directory

file not found

D:\>dir *.py
 The volume in the drive D is
 the serial number of the DATA volume is 547F-1046

 D:\ directory

2023/05/07 19:55 1,221 pdf2doc2.py
               1 file 1,221 bytes
               0 directories 195,912,142,848 available bytes

D:\>python pdf2doc2.py
SyntaxError: Non-UTF-8 code starting with '\xd5' in file D:\pdf2doc2.py on line 4, but no encoding declared; see https://peps.python.org/pep-0263/ for details

D:\>


Microsoft Windows [Version 10.0.19044.2728]
(c) Microsoft Corporation. all rights reserved.

C:\Users\QQ>d:

D:\>dir *.pty
 The volume in drive D is
 the serial number of the DATA volume is 547F-1046

 D:\ directory

file not found

D:\>dir *.py
 The volume in the drive D is
 the serial number of the DATA volume is 547F-1046

 D:\ directory

2023/05/07 19:55 1,221 pdf2doc2.py
               1 file 1,221 bytes
               0 directories 195,912,142,848 available bytes

D:\>python pdf2doc2.py
SyntaxError: Non-UTF-8 code starting with '\xd5' in file D:\pdf2doc2.py on line 4, but no encoding declared; see https://peps.python.org/pep-0263/ for details

D:\>
D:\>python pdf2doc2.py
  File "D:\pdf2doc2.py", line 36
    print('A total of %d docx articles have been completely converted to txt' pdf_num-1))
                                           ^
SyntaxError: unmatched ') '

D:\>python pdf2doc2.py
MIDE-599.google.docx
Part 1 docx has been converted to txt
OAE-101.google.docx
Part 2 docx has been converted to txt
OAE-165.google.docx
Part 3 docx has been converted Converted to txt
OFJE-139 1.google.docx
4th docx converted to txt
OFJE-139 2.google.docx
5th docx converted to txt
OFJE-189.google.docx
6th docx converted to txt
OFJE-236.google.docx
Part 7 docx converted to txt
pSSNI-473.google.docx
Part 8 docx converted to txt
SIVR-001.google.docx
Part 9 docx converted to txt
SIVR-002.google .docx
10th docx converted to txt
SIVR-003.google.docx
11th docx converted to txt
SIVR-012 1.google.docx
12th docx converted to txt
SIVR-012 2.google.docx
13th docx converted to txt
SIVR-015 1.google.docx
14th docx converted to txt
SIVR-015 2.google.docx
15th docx converted to txt
SIVR-016 1.google.docx
16th Article docx converted to txt
SIVR-016 2.google.docx
Article 17 docx converted to txt
SIVR-017 1.google.docx
Article 18 docx converted to txt
SIVR-017 2.google.docx
Article 19 docx Converted to txt
SIVR-017 3.google.docx
20th docx converted to txt
SIVR-033 1.google.docx
21st docx converted to txt
SIVR-033 2.google.docx
22nd docx converted txt
SIVR-033 3.google.docx
23rd docx converted to txt
SIVR-033 4.google.docx
24th docx converted to txt
SIVR-033 5.google.docx
25th docx converted to txt
SIVR-033 6. google.docx
No. 26 docx converted to txt
SIVR-034 1. google.docx
No. 27 docx converted to txt
SIVR-034 2. google.docx
No. 28 docx converted to txt
SIVR- 034 3. google.docx
29th docx converted to txt
SIVR-044 1. google.docx
30th docx converted to txt
SIVR-044 2. google.docx
31st docx converted to txt
SIVR-061 1 .google.docx
32nd docx converted to txt
SIVR-061 2.google.docx
33rd docx converted to txt
SIVR-061 3.google.docx
34th docx converted to txt
SIVR-061 4.google .docx
35th docx converted to txt
SIVR-067 1.google.docx
36th docx converted to txt
SIVR-067 2.google.docx
37th docx converted to txt
SIVR-067 3.google.docx
Article 38 docx converted to txt
SNIS-786.google.docx
Article 39 docx converted to txt
SNIS-800.google.docx
Article 40 docx converted to txt
SNIS-850 1.google.docx
Article 41 docx Converted to txt
SNIS-850 2.google.docx
No. 42 docx converted to txt
SNIS-872.google.docx
No. 43 docx converted to txt
SNIS-896.google.docx
No. 44 docx converted to txt
SNIS-919.google.docx
No. 45 docx converted to txt
SNIS-964.google.docx
No. 46 docx converted to txt
SNIS-964.google2.docx
No. 47 docx converted to txt
SNIS-986.google .docx
48th docx converted to txt
SSNI-009.google.docx
49th docx converted to txt
SSNI-030.google.docx
50th docx converted to txt
SSNI-054.google.docx
Article 51 docx converted to txt
SSNI-077.google.docx
Article 52 docx converted to txt
SSNI-101.google.docx
Article 53 docx converted to txt
SSNI-127.google.docx
Article 54 docx converted Converted to txt
SSNI-152.google.docx
Article 55 docx converted to txt
SSNI-178.google.docx
Article 56 docx converted to txt
SSNI-205.google.docx
Article 57 docx converted to txt SSNI-178.google.docx Article 57 docx converted to
txt 229.google.docx
No. 58 docx converted to txt
SSNI-254.google.docx
No. 59 docx converted to txt
SSNI-279.google.docx
No. 60 docx converted to txt
SSNI-301.google.docx
Article 61 docx converted to txt
SSNI-322.google.docx
Article 62 docx converted to txt
SSNI-344.google.docx
Article 63 docx converted to txt
SSNI-388.google.docx
64th docx converted to txt
SSNI-409.google.docx
65th docx converted to txt
SSNI-432.google.docx
66th docx converted to txt
SSNI-452.google.docx
67th docx converted Converted to txt
SSNI-473.google.docx
No. 68 docx converted to txt
SSNI-493.google.docx
No. 69 docx converted to txt
SSNI-516.google.docx
No. 70 docx converted to txt
SSNI- 542.google.docx
71st docx converted to txt
SSNI-566.google.docx
72nd docx converted to txt
SSNI-589.google.docx
73rd docx converted to txt
SSNI-618.google.docx
Article 74 docx converted to txt
SSNI-644.google.docx
Article 75 docx converted to txt
SSNI-674.google.docx
Article 76 docx converted to txt
SSNI-703.google.docx
77th docx converted to txt
SSNI-730.google.docx
78th docx converted to txt
TEK-067.google.docx
79th docx converted to txt
TEK-071.google.docx
80th docx converted Converted to txt
TEK-072.google.docx
No. 81 docx converted to txt
TEK-073.google.docx
No. 82 docx converted to txt
TEK-076.google.docx
No. 83 docx converted to txt
TEK- 079 Audio only.google.docx
No. 84 docx converted to txt
TEK-080.google.docx
No. 85 docx converted to txt
TEK-081 Audio only.google.docx
No. 86 docx converted to txt
TEK-083 Audio only.google.docx
Chapter 87 docx converted to txt
TEK-097.google.docx
Chapter 88 docx converted to txt

D:\>


Reference:
python batch convert DOCX TXT


https://blog.csdn.net/weixin_46255747/article/details/129961988
python implements batch docx to txt


ModuleNotFoundError: No module named 'pdf2docx'


python win32com pip install


https://blog.csdn.net/qq_45662588/article/details/130315080
The solution to installing win32com library in python3.9


https://blog.csdn.net/longe20111104/article/details/129754624
pip install win32com error solution
pip install pypiwin32


SyntaxError: Non-UTF-8 code starting with '\xd5' in file D:\pdf2doc2.py on line 4, but no encoding d


https://blog.csdn.net/coco_apple/article/details/113437552
SyntaxError: Non-UTF-8 code starting with ‘\xd5‘ in file
# – coding: gbk –

 

 

 

 

Guess you like

Origin blog.csdn.net/wb4916/article/details/130547425