Python3 crawler combat -44, the contact point of the selected identification codes

We realized on a very experimental verification code identification, but except in a very experimental fact, there is another common and widely used verification code, is more representative of the Touch verification code.

You may be unfamiliar with this name, but certainly seen a similar code, such as 12306, which is a typical touch verification code shown in Figure 8-18:

Python3 crawler combat -44, the contact point of the selected identification codes

FIG 8-1812306 codes

We need to click on the figure to meet the requirements of the drawing, all the answers are correct only if the verification is successful, if there is an answer wrong, validation will fail, this code can be called tap verification code. Do not understand the learning process can join our learning exchange Qiuqiu intermediate ring 784 758 214 back to share the moment Python enterprise talent needs and how Python from the zero-based learning with you, and learn what content. Related video learning materials, development tools have to share

There is also a special touch to provide verification code service site called TouClick, its official website is:https://www.touclick.com/ , this section will explain it as an example of such verification process identification code.

1. The objective of this section

Our goal in this section is to be identified by touch program verification and verification codes.

2. Preparation

The Python library we are using Selenium, use the browser Chrome, before make sure you have properly installed the library Selenium, Chrome browser and configured the ChromeDriver, related processes can refer to the description of the first chapter.

3. Learn Touch codes

TouClick official website of the code pattern shown in Figure 8-19:

Python3 crawler combat -44, the contact point of the selected identification codes

FIG style codes 8-19

And 12306 sites are similar, but this time click on the picture of the text, not the pictures, in addition to various forms of touch-shaped ××× code, its interactive form may be slightly different, but the basic principles are the It is similar.

Next we come to realize what a unified identification process such tap verification code.

4. Identify ideas

If you rely on this verification code identification image recognition, then the difficulty is very large.

For example, on the 12306, which is difficult to identify two points, the first point is the character recognition, shown in Figure 8-20:

Python3 crawler combat -44, the contact point of the selected identification codes

FIG 8-2012306 codes

如点击图中所有的漏斗,“漏斗”二字其实都经过变形、放缩、模糊处理了,如果要借助于前面我们讲的 OCR 技术来识别,识别的精准度会大打折扣,甚至得不到任何结果。第二点是图像的识别,我们需要将图像重新转化文字,可以借助于各种识图接口,可经我测试识别正确结果的准确率非常低,经常会出现匹配不正确或匹配不出结果的情况,而且图片本身的的清晰度也不够,所以识别难度会更大,更何况需要同时识别出八张图片的结果,且其中几个答案需要完全匹配正确才能验证通过,综合来看,此种方法基本是不可行的。

再拿 TouClick 来说,如图 8-21 所示:

Python3 crawler combat -44, the contact point of the selected identification codes

图 8-21 验证码示例

我们需要从这幅图片中识别出植株二字,但是图片的背景或多或少会有干扰,导致 OCR 几乎不会识别出结果,有人会说,直接识别白色的文字不就好了吗?但是如果换一张验证码呢?如图 8-22 所示:

Python3 crawler combat -44, the contact point of the selected identification codes

图 8-22 验证码示例

这张验证码图片的文字又变成了蓝色,而且还又有白色阴影,识别的难度又会大大增加。

那么此类验证码就没法解了吗?答案当然是有,靠什么?靠人。

靠人解决?那还要程序做什么?不要急,这里说的人并不是我们自己去解,在互联网上存在非常多的验证码服务平台,平台 7×24 小时提供验证码识别服务,一张图片几秒就会获得识别结果,准确率可达 90% 以上,但是就需要花点钱来购买服务了,毕竟平台都是需要盈利的,不过不用担心,识别一个验证码只需要几分钱。

在这里我个人比较推荐的一个平台是超级鹰,其官网为:https://www.chaojiying.com,非广告。

其提供的服务种类非常广泛,可识别的验证码类型非常多,其中就包括此类点触验证码。

另外超级鹰平台同样支持简单的图形验证码识别,如果 OCR 识别有难度,同样可以用本节相同的方法借助此平台来识别,下面是此平台提供的一些服务:

  • 英文数字,提供最多20位英文数字的混合识别
  • 中文汉字,提供最多7个汉字的识别
  • 纯英文,提供最多12位的英文的识别
  • 纯数字,提供最多11位的数字的识别
  • 任意特殊字符,提供不定长汉字英文数字、拼音首字母、计算题、成语混合、 集装箱号等字符的识别
  • 坐标选择识别,如复杂计算题、选择题四选一、问答题、点击相同的字、物品、动物等返回多个坐标的识别

具体如有变动以官网为准:https://www.chaojiying.com/price.html

而本节我们需要解决的就是属于最后一类,坐标多选识别的情况,我们需要做的就是将验证码图片提交给平台,然后平台会返回识别结果在图片中的坐标位置,接下来我们再解析坐标模拟点击就好了。

原理非常简单,下面我们就来实际用程序来实验一下。

5. 注册账号

在开始之前,我们需要先注册一个超级鹰账号并申请一个软件ID,注册页面链接为:https://www.chaojiying.com/user/reg/,注册完成之后还需要在后台开发商中心添加一个软件ID,最后一件事就是充值一些题分,充值多少可以根据价格和识别量自行决定。

6. 获取API

做好上面的准备工作之后我们就可以开始用程序来对接验证码的识别了。

首先我们可以到官方网站下载对应的 Python API

修改之后的API如下:


import requests

from hashlib import md5

class  Chaojiying(object):

    def __init__(self,  username,  password,  soft_id):

        self.username  =  username

        self.password  =  md5(password.encode('utf-8')).hexdigest()

        self.soft_id  =  soft_id

        self.base_params  =  {

            'user':  self.username,

            'pass2':  self.password,

            'softid':  self.soft_id,

        }

        self.headers  =  {

            'Connection':  'Keep-Alive',

            'User-Agent':  'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)',

        }

    def post_pic(self,  im,  codetype):

        """

        im: 图片字节

        codetype: 题目类型 参考 http://www.chaojiying.com/price.html

        """

        params  =  {

            'codetype':  codetype,

        }

        params.update(self.base_params)

        files  =  {'userfile':  ('ccc.jpg',  im)}

        r  =  requests.post('http://upload.chaojiying.net/Upload/Processing.php',  data=params,  files=files,  headers=self.headers)

        return  r.json()

    def report_error(self,  im_id):

        """

        im_id:报错题目的图片ID

        """

        params  =  {

            'id':  im_id,

        }

        params.update(self.base_params)

        r  =  requests.post('http://upload.chaojiying.net/Upload/ReportError.php',  data=params,  headers=self.headers)

        return  r.json()

这里定义了一个 Chaojiying 类,其构造函数接收三个参数,分别是超级鹰的用户名、密码以及软件ID,保存好以备使用。

接下来是最重要的一个方法叫做 post_pic(),这里需要传入图片对象和验证码的代号,该方法会将图片对象和相关信息发给超级鹰的后台进行识别,然后将识别成功的 Json 返回回来。

另一个方法叫做 report_error(),这个是发生错误的时候的回调,如果验证码识别错误,调用此方法会返还相应的题分。

接下来我们以 TouClick 的官网为例来进行演示点触验证码的识别过程,链接为:http://admin.touclick.com/,如果没有注册账号可以先注册一个。

7. 初始化

首先我们需要初始化一些变量,如 WebDriver、Chaojiying对象等等,代码实现如下:


EMAIL  =  '[email protected]'

PASSWORD  =  ''

# 超级鹰用户名、密码、软件ID、验证码类型

CHAOJIYING_USERNAME  =  'Germey'

CHAOJIYING_PASSWORD  =  ''

CHAOJIYING_SOFT_ID  =  893590

CHAOJIYING_KIND  =  9102

class  CrackTouClick():

    def __init__(self):

        self.url  =  'http://admin.touclick.com/login.html'

        self.browser  =  webdriver.Chrome()

        self.wait  =  WebDriverWait(self.browser,  20)

        self.email  =  EMAIL

        self.password  =  PASSWORD

        self.chaojiying  =  Chaojiying(CHAOJIYING_USERNAME,  CHAOJIYING_PASSWORD,  CHAOJIYING_SOFT_ID)

这里的账号和密码请自行修改。

8. 获取验证码

接下来的第一步就是完善相关表单,然后模拟点击呼出验证码,此步非常简单,代码实现如下:

def open(self):

    """

    打开网页输入用户名密码

    :return: None

    """

    self.browser.get(self.url)

    email  =  self.wait.until(EC.presence_of_element_located((By.ID,  'email')))

    password  =  self.wait.until(EC.presence_of_element_located((By.ID,  'password')))

    email.send_keys(self.email)

    password.send_keys(self.password)

def get_touclick_button(self):

    """

    获取初始验证按钮

    :return:

    """

    button  =  self.wait.until(EC.element_to_be_clickable((By.CLASS_NAME,  'touclick-hod-wrap')))

    return  button

在这里 open() 方法负责填写表单,get_touclick_button() 方法则是获取验证码按钮,随后触发点击即可。

接下来我们需要类似上一节极验验证码图像获取一样,首先获取验证码图片的位置和大小,随后从网页截图里面截取相应的验证码图片就好了。代码实现如下:


def get_touclick_element(self):

    """

    获取验证图片对象

    :return: 图片对象

    """

    element  =  self.wait.until(EC.presence_of_element_located((By.CLASS_NAME,  'touclick-pub-content')))

    return  element

def get_position(self):

    """

    获取验证码位置

    :return: 验证码位置元组

    """

    element  =  self.get_touclick_element()

    time.sleep(2)

    location  =  element.location

    size  =  element.size

    top,  bottom,  left,  right  =  location['y'],  location['y']  +  size['height'],  location['x'],  location['x']  +  size[

        'width']

    return  (top,  bottom,  left,  right)

def get_screenshot(self):

    """

    获取网页截图

    :return: 截图对象

    """

    screenshot  =  self.browser.get_screenshot_as_png()

    screenshot  =  Image.open(BytesIO(screenshot))

    return  screenshot

def get_touclick_image(self,  name='captcha.png'):

    """

    获取验证码图片

    :return: 图片对象

    """

    top,  bottom,  left,  right  =  self.get_position()

    print('验证码位置',  top,  bottom,  left,  right)

    screenshot  =  self.get_screenshot()

    captcha  =  screenshot.crop((left,  top,  right,  bottom))

    return  captcha

在这里 get_touclick_image() 方法即为从网页截图中截取对应的验证码图片,其中验证码图片的相对位置坐标由 get_position() 方法返回得到,最后我们得到的是一个 Image 对象。

9. 识别验证码

随后我们调用 Chaojiying 对象的 post_pic() 方法即可把图片发送给超级鹰后台,在这里发送的图像是字节流格式,代码实现如下:


image  =  self.get_touclick_image()

bytes_array  =  BytesIO()

image.save(bytes_array,  format='PNG')

# 识别验证码

result  =  self.chaojiying.post_pic(bytes_array.getvalue(),  CHAOJIYING_KIND)

print(result)

这样运行之后 result 变量就是超级鹰后台的识别结果,可能运行需要等待几秒,毕竟后台还有人工来完成识别。

返回的结果是一个 Json,如果识别成功后一个典型的返回结果类似如下:


{'err_no':  0,  'err_str':  'OK',  'pic_id':  '6002001380949200001',  'pic_str':  '132,127|56,77',  'md5':  '1f8e1d4bef8b11484cb1f1f34299865b'}

其中 pic_str 就是识别的文字的坐标,是以字符串形式返回的,每个坐标都以 | 分隔,所以接下来我们只需要将其解析之后再模拟点击即可,代码实现如下:


def get_points(self,  captcha_result):

    """

    解析识别结果

    :param captcha_result: 识别结果

    :return: 转化后的结果

    """

    groups  =  captcha_result.get('pic_str').split('|')

    locations  =  [[int(number)  for  number in  group.split(',')]  for  group in  groups]

    return  locations

def touch_click_words(self,  locations):

    """

    点击验证图片

    :param locations: 点击位置

    :return: None

    """

    for  location in  locations:

        print(location)

        ActionChains(self.browser).move_to_element_with_offset(self.get_touclick_element(),  location[0],  location[1]).click().perform()

        time.sleep(1)

在这里我们用 get_points() 方法将识别结果变成了列表的形式,最后 touch_click_words() 方法则通过调用 move_to_element_with_offset() 方法依次传入解析后的坐标,然后点击即可。

这样我们就可以模拟完成坐标的点选了,运行效果如图 8-23 所示:

Python3 crawler combat -44, the contact point of the selected identification codes

图 8-23 点选效果

最后我们需要做的就是点击提交验证的按钮等待验证通过,再点击登录按钮即可成功登录,后续实现在此不再赘述。

So we completed by means of internet online identification codes tap verification codes, this method is a general method of using this method to identify other codes 12306 are exactly the same principle.

10. Conclusion

In this section we assist the completion of the identification code by a code online platforms, such identification method is very powerful, almost any code can be identified, if you encounter problems, by means of a code platform is undoubtedly an excellent choice.

Guess you like

Origin blog.51cto.com/14445003/2427545