Practical and easy to use, 6 Python special text format processing libraries are recommended

The following are some libraries written in Python for parsing and manipulating special text formats, hoping to help you.

1 、 Tablib

Tablib is a Python library for working with tabular data, allowing import, export, and management of tabular data, and advanced features including slicing, dynamic columns, labels and filtering, and formatted import and export.

Tablib supports export/import formats including: Excel, JSON, YAML, HTML, TSV and CSV, XML is not currently supported.

>>> data = tablib.Dataset(headers=['First Name', 'Last Name', 'Age'])
>>> for i in [('Kenneth', 'Reitz', 22), ('Bessie', 'Monke', 21)]:
...     data.append(i)


>>> print(data.export('json'))
[{"Last Name": "Reitz", "First Name": "Kenneth", "Age": 22}, {"Last Name": "Monke", "First Name": "Bessie", "Age": 21}]

>>> print(data.export('yaml'))
- {Age: 22, First Name: Kenneth, Last Name: Reitz}
- {Age: 21, First Name: Bessie, Last Name: Monke}

>>> data.export('xlsx')
<censored binary data>

>>> data.export('df')
  First Name Last Name  Age
0    Kenneth     Reitz   22
1     Bessie     Monke   21

2、Openpyxl

Openpyxl is a Python library for reading and writing Excel 2010 xlsx/xlsm/xltx/xltm files.

Openpyxl was born for Python to natively read/write the Office Open XML format and was originally developed based on  PHPExcel  .

from openpyxl import Workbook
wb = Workbook()

# grab the active worksheet
ws = wb.active

# Data can be assigned directly to cells
ws['A1'] = 42

# Rows can also be appended
ws.append([1, 2, 3])

# Python types will automatically be converted
import datetime
ws['A2'] = datetime.datetime.now()

# Save the file
wb.save("sample.xlsx")

3、unoconv

unoconv, short for Universal Office Converter, is a command-line tool that converts between any file formats supported by LibreOffice/OpenOffice.

unoconv supports batch conversion of documents, and can also combine asciidoc and docbook2odf/xhtml2odt to create PDF or Word (.doc) files.

[dag@moria cv]$ make odt pdf html doc
rm -f *.{odt,pdf,html,doc}
asciidoc -b docbook -d article -o resume.xml resume.txt
docbook2odf -f --params generate.meta=0 -o resume.tmp.odt resume.xml
Saved resume.tmp.odt
unoconv -f odt -t template.ott -o resume.odt resume.tmp.odt
unoconv -f pdf -t template.ott -o resume.pdf resume.odt
unoconv -f html -t template.ott -o resume.html resume.odt
unoconv -f doc -t template.ott -o resume.doc resume.odt

4 、 PyPDF2

PyPDF2 is a pure Python PDF library capable of splitting, merging, cropping and converting PDF file pages. It can also add custom data, viewing options and passwords to PDF files.

PyPDF2 can retrieve text and metadata from PDFs, as well as merge entire files together.

from PyPDF2 import PdfFileWriter, PdfFileReader

output = PdfFileWriter()
input1 = PdfFileReader(open("document1.pdf", "rb"))

# print how many pages input1 has:
print "document1.pdf has %d pages." % input1.getNumPages()

# add page 1 from input1 to output document, unchanged
output.addPage(input1.getPage(0))

# add page 2 from input1, but rotated clockwise 90 degrees
output.addPage(input1.getPage(1).rotateClockwise(90))

# add page 3 from input1, rotated the other way:
output.addPage(input1.getPage(2).rotateCounterClockwise(90))
# alt: output.addPage(input1.getPage(2).rotateClockwise(270))

# add page 4 from input1, but first add a watermark from another PDF:
page4 = input1.getPage(3)
watermark = PdfFileReader(open("watermark.pdf", "rb"))
page4.mergePage(watermark.getPage(0))
output.addPage(page4)


# add page 5 from input1, but crop it to half size:
page5 = input1.getPage(4)
page5.mediaBox.upperRight = (
    page5.mediaBox.getUpperRight_x() / 2,
    page5.mediaBox.getUpperRight_y() / 2
)
output.addPage(page5)

# add some Javascript to launch the print window on opening this PDF.
# the password dialog may prevent the print dialog from being shown,
# comment the the encription lines, if that's the case, to try this out
output.addJS("this.print({bUI:true,bSilent:false,bShrinkToFit:true});")

# encrypt your new PDF and add a password
password = "secret"
output.encrypt(password)

# finally, write "output" to document-output.pdf
outputStream = file("PyPDF2-output.pdf", "wb")
output.write(outputStream)

5、Mistune

Mistune is a Markdown parser implemented in pure Python with complete functions, including tables, comments, code blocks, etc.

Mistune is claimed to be the fastest of all pure Python markdown parsers ( benchmark results ). It was designed with modularity in mind to provide a clear and easy-to-use extensible API.

import mistune

mistune.markdown('I am using **mistune markdown parser**')
# output: <p>I am using <strong>mistune markdown parser</strong></p>

6 、 csvkit

csvkit is known as the Swiss Army Knife for processing csv files, integrating  csvlook utilities   such as csvcut . csvsql

csvkit is a command line tool inspired by  pdftk  , gdal  and other similar tools.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324996363&siteId=291194637