之前看到爬虫攻防之前端策略简析中提到猫眼电影的今日票房栏为自定义字体,但是博客中提到使用fonttools进行加载然后人工先把几个数字的座标点进行标记,然后每次刷新时,拿到新的woff字体时,通过fonttool将字体转换成xml格式,根据座标点信息,判断其uncode值分别是多少。然后再将代码中的“方框”转换成真实数字。
感觉这样比较麻烦,于是想通过转为图片然后进行识别得到数字,代码如下
from reportlab.graphics import renderPM
from reportlab.graphics.shapes import Group, Drawing, scale
path = 'E:/pycharm_workplace/black_list/d2080.woff'
font = TTFont(path) # it would work just as well with fontTools.t1Lib.T1Font
glyf = font['glyf']
for glyphName in glyf.keys():
imageFile = "%s.png" % glyphName
gs = font.getGlyphSet()
pen = ReportLabPen(gs, Path(fillColor=colors.black, strokeWidth=1))
g = gs[glyphName]
g.draw(pen)
w, h = g.width, g.height or 719
# Everything is wrapped in a group to allow transformations.
g = Group(pen.path)
# g.translate(10, 200)
# g.scale(0.3, 0.3)
d = Drawing(w, h)
d.add(g)
image = renderPM.drawToPIL(d)
image.show()
这样的图片贼标准,然而pytesseract识别不出来!!!,就是下图
可能太标准了,于是就改小了
from reportlab.graphics import renderPM
from reportlab.graphics.shapes import Group, Drawing, scale
path = 'E:/pycharm_workplace/black_list/ada5e56ac664e0088e72a725738b7c9d2080.woff'
font = TTFont(path) # it would work just as well with fontTools.t1Lib.T1Font
glyf = font['glyf']
for glyphName in glyf.keys():
imageFile = "%s.png" % glyphName
gs = font.getGlyphSet()
pen = ReportLabPen(gs, Path(fillColor=colors.black, strokeWidth=1))
g = gs[glyphName]
g.draw(pen)
w, h = g.width, g.height or 719
# Everything is wrapped in a group to allow transformations.
g = Group(pen.path)
d = Drawing(w, h)
d.add(g)
image = renderPM.drawToPIL(d)
little_image = image.resize((20, 35))
fromImage = Image.new('RGBA', (20, 40), color=(255, 255, 255))
fromImage.paste(little_image, (0, 2))
print("::", pytesseract.image_to_string(fromImage))
像这样,但是还是识别不出来,看来需要先训练在识别了