Python爬虫模拟登录(六):验证码识别一

用到第三方库pytesseract,配合PIL使用;

pip install pytesseract

对简单的无干扰的图片识别效果还行;

识别中文要装tesseract-ocr;安装


    imgbuf = s.get(imgurl).content

    f = BytesIO()
    f.write(imgbuf)

    img = Image.open(f)
    img.show()

    vercode = pytesseract.image_to_string(img)
    print("Verification Code:", vercode)
    # vercode = input("Verification Code:")

效果:



Quickstart: https://pypi.org/project/pytesseract/

**Quickstart**

.. code-block:: python

try:
import Image
except ImportError:
from PIL import Image
import pytesseract

pytesseract.pytesseract.tesseract_cmd = '<full_path_to_your_tesseract_executable>'
# Include the above line, if you don't have tesseract executable in your PATH
# Example tesseract_cmd: 'C:\\Program Files (x86)\\Tesseract-OCR\\tesseract'

# Simple image to string
print(pytesseract.image_to_string(Image.open('test.png')))

# French text image to string
print(pytesseract.image_to_string(Image.open('test-european.jpg'), lang='fra'))

# Get bounding box estimates
print(pytesseract.image_to_boxes(Image.open('test.png')))

# Get verbose data including boxes, confidences, line and page numbers
print(pytesseract.image_to_data(Image.open('test.png')))

# Get informations about orientation and script detection
print(pytesseract.image_to_osd(Image.open('test.png'))

Support for OpenCV image/NumPy array objects

.. code-block:: python

import cv2

img = cv2.imread('/**path_to_image**/digits.png')
print(pytesseract.image_to_string(img))
# OR explicit beforehand converting
print(pytesseract.image_to_string(Image.fromarray(img))

Add the following config, if you have tessdata error like: "Error opening data file..."

.. code-block:: python

tessdata_dir_config = '--tessdata-dir "<replace_with_your_tessdata_dir_path>"'
# Example config: '--tessdata-dir "C:\\Program Files (x86)\\Tesseract-OCR\\tessdata"'
# It's important to add double quotes around the dir path.

pytesseract.image_to_string(image, lang='chi_sim', config=tessdata_dir_config)


**Functions**

* **get_tesseract_version** Returns the Tesseract version installed in the system.

* **image_to_string** Returns the result of a Tesseract OCR run on the image to string

* **image_to_boxes** Returns result containing recognized characters and their box boundaries

* **image_to_data** Returns result containing box boundaries, confidences, and other information. Requires Tesseract 3.05+. For more information, please check the `Tesseract TSV documentation <https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage#tsv-output-currently-available-in-305-dev-in-master-branch-on-github>`_

* **image_to_osd** Returns result containing informations about orientation and script detection.

**Parameters**

``image_to_data(image, lang=None, config='', nice=0, output_type=Output.STRING)``

* **image** Object, PIL Image/NumPy array of the image to be processed by Tesseract

* **lang** String, Tesseract language code string

* **config** String, Any additional configurations as a string, ex: ``config='--psm 6'``

* **nice** Integer, modifies the processor priority for the Tesseract run. Not supported on Windows. Nice adjusts the niceness of unix-like processes.

* **output_type** Class attribute, specifies the type of the output, defaults to ``string``. For the full list of all supported types, please check the definition of `pytesseract.Output <https://github.com/madmaze/pytesseract/blob/master/src/pytesseract.py>`_ class.

猜你喜欢

转载自blog.csdn.net/M_N_N/article/details/80862683