Python识别验证码的另一种花样玩法

概述

简介

坑！

安装 Tesseract-OCR

使用 pytesseract 识别验证码

高级玩法 - 除线

简介

首先呢，简单的验证码是这样的：

code.jpg

不是这样的：

image.png

这里使用了 pytesseract 来进行验证码识别，它是基于 Google 的 Tesseract-OCR ，所以在使用之前需要先安装 Tesseract-OCR。使用 PIL 来进行图像处理。pytesseract 默认支持 tiff、bmp 图片格式，使用 PIL 库之后，能够支持 jpeg、gif、png 等其他图片格式；

坑！

PIL(Python Imaging Library) 库只支持 32 位的系统，如果要在 64 位系统中使用，请安装 pillow。嗯，这个真是坑死我了，为了安装这个倒腾了很久。希望能帮到你。

pillow 中文文档

pillow 的缘由：由于PIL仅支持到Python 2.7，加上年久失修，于是一群志愿者在PIL的基础上创建了兼容的版本，名字叫Pillow，支持最新Python 3.x，又加入了许多新特性。

32 位系统

pip install PIL

64 位系统

pip install pillow

安装 Tesseract-OCR

在使用 pytesseract 之前，必须安装 tesseract-ocr ，因为 pytesserat 依赖于 tesseract-ocr ，否则无法使用

Mac

brew install tesseract

centos7

yum-config-manager --add-repohttps://download.opensuse.org/repositories/home:/Alexander_Pozdnyakov/CentOS_7/yum updateyum install tesseract yum install tesseract-langpack-deu

windows

download-address

使用 pytesseract 识别验证码

首先将图像灰度化

#使用路径导入图片im = Image.open(imgimgName)#使用 byte 流导入图片# im = Image.open(io.BytesIO(b))# 转化到灰度图imgry = im.convert('L')# 保存图像imgry.save('gray-'+ imgName)

灰度化的图像是这个样子的：

gray-code.jpg

然后将图像二值化

# 二值化，采用阈值分割法，threshold为分割点threshold =140table = []forjinrange(256):ifj < threshold: table.append(0)else: table.append(1)out= imgry.point(table,'1')out.save('b'+ imgName)