Use the table character recognition technology, the paper registration form personal information, merchandise, publicity and content identification, electronic table of contents quickly realize, structured for registration information collation and statistics, a significant reduction in the work of human information electronically input costs, improve information management convenience
One. Access platform
This step is relatively simple, not much elaboration. Before a document can refer to:
https://ai.baidu.com/forum/topic/show/943162
II. Analysis of interface documentation
1. Open API documentation page, interface requirements analysis
https://ai.baidu.com/docs#/OCR-API/87932804
(1) Interface description
Text in the table image is extracted and recognized, the structure of the output header, footer, and text for each cell. Containing conventional form capable of recognizing and merged table cell, and can choose to return JSON or Excel format.
(2) Description Request
Information need to use are:
请求URL:https://aip.baidubce.com/rest/2.0/solution/v1/form_ocr/request
Header格式:Content-Type:application/x-www-form-urlencoded
Body placed request parameters, the parameters as follows:
This interface asynchronous interface, divided into two API: Submit request interface, the interface obtaining results. There is a key parameter: is_sync, the value is "false", the need to obtain recognition results by obtaining the results of the interface; value is "true", synchronous return recognition results, without having to call to get the results interface. Of course, one can never get used twice, just set the parameter to "true" can be.
(3) Return Parameter
Returning to the example
{"result": {"result_data":"http://bj.bcebos.com/v1/ai-edgecloud/4F00EC7AED4E4827BD517CB105E56DEB?authorization=bce-auth-v1%2Ff86a2044998643b5abc89b59158bad6d%2F2019-08-10T07%3A28%3A13Z%2F172800%2F%2F374c64232876bcbe78a54105e438a97376f530788e5386e04f67d0cba4935f3d", "ret_msg":"\xe5\xb7\xb2\xe5\xae\x8c\xe6\x88\x90", "percent":100, "ret_code":3}, "log_id":1565422091617865}
2.获取access_token
# encoding:utf-8 import base64 import urllib import urllib2 request_url = " https://aip.baidubce.com/rest/2.0/solution/v1/form_ocr/request " # 二进制方式打开视频文件 f = open('[本地文件]', 'rb') img = base64.b64encode(f.read()) params = {"data": data } params = urllib.urlencode(params) access_token = '[调用鉴权接口获取的token]' request_url = request_url + "?access_token=" + access_token request = urllib2.Request(url=request_url, data=params) request.add_header('Content-Type', 'application/x-www-form-urlencoded') response = urllib2.urlopen(request) content = response.read() if content: print content
三.识别结果
1.
识别结果:
2.
识别结果:
3.
识别结果:
4.
识别结果:
结论:
识别结果方面:采用不同形式的复杂表格进行测试,识别结果比较准确,能够大大减少信息录入工作。
处理速度方面:每张图片处理时间在3-5s,可以接受。
四.源码共享
# -*- coding: utf-8 -*- #!/usr/bin/env python import urllib import urllib.parse import urllib.request import base64 import json import time #client_id 为官网获取的AK, client_secret 为官网获取的SK client_id = '*******************' client_secret = '*********************' #获取token def get_token(): host = 'https://aip.baidubce.com/oauth/2.0/token?grant_type=client_credentials&client_id=' + client_id + '&client_secret=' + client_secret request = urllib.request.Request(host) request.add_header('Content-Type', 'application/json; charset=UTF-8') response = urllib.request.urlopen(request) token_content = response.read() if token_content: token_info = json.loads(token_content.decode("utf-8")) token_key = token_info['access_token'] return token_key # 读取图片 def get_file_content(filePath): with open(filePath, 'rb') as fp: return fp.read() #获取表格信息 def get_license_plate(path): request_url = "https://aip.baidubce.com/rest/2.0/solution/v1/form_ocr/request" f = get_file_content(path) access_token=get_token() print (access_token) img = base64.b64encode(f) # params = {"image": img,"is_sync": 'true',"request_type": 'json'} params = {"image": img,"is_sync": 'true',"request_type": 'excel'} params = urllib.parse.urlencode(params).encode('utf-8') request_url = request_url + "?access_token=" + access_token tic = time.clock() request = urllib.request.Request(url=request_url, data=params) request.add_header('Content-Type', 'application/x-www-form-urlencoded') response = urllib.request.urlopen(request) content = response.read() toc = time.clock() print('处理时长: '+'%.2f' %(toc - tic) +' s') if content: print (content) license_plates = json.loads(content.decode("utf-8")) excel_url = license_plates['result']['result_data'] excel = urllib.request.urlopen(excel_url) with open("sbg.xls", "wb") as code: code.write(excel.read()) return content else: return '' image_path='F:\paddle\sbg\s6.jpg' get_license_plate(image_path)
五.意见建议
1.整体识别效果还是不错的,识别结果的精确度还有待提高,细节处理还可以更完善。比如复杂表格识别文字串行,个别文字丢失或错误等。
2.对表格中有手写体文字的识别效果不好,建议增加对手写输入的识别。
作者:wangwei8638