破解抖音字体反爬获取用户基础数据

背景

抖音WEB页面可以获取用户昵称、抖音号、用户签名、关注数、粉丝数、获赞数、作品数、喜欢数

抖音为了防止数据被爬取,所有的的数字数据都是用icon图标填充渲染,直接获取WEB页面代码,发现具体数值为包含字母的字符,不是数字

本文会介绍如何找出字符与数值的对应关系
在这里插入图片描述

分析字体文件

1.访问抖音 WEB 页面,发现有woff字体文件请求,复制url直接下载字体文件

在这里插入图片描述

2.使用 Python 的一个工具包 fontTools 来查看字体的编码映射关系

安装fontTools工具包命令:

pip install fontTools

利用 fontTools 将字体文件转为 XML 文件,以下是转换代码:

from fontTools.ttLib import TTFont
font = TTFont(r'/Users/linchen/Downloads/iconfont_9eb9a50.woff')
font.saveXML('/Users/linchen/Downloads/font.xml')

得到转换后的 XML 文件(以下为部分内容,只需要 GlyphOrder 和 cmap 数据):

<?xml version="1.0" encoding="UTF-8"?>
<ttFont sfntVersion="\x00\x01\x00\x00" ttLibVersion="4.1">
 
  <GlyphOrder>
    <!-- The 'id' attribute is only for humans; it is ignored when parsed. -->
    <GlyphID id="0" name="glyph00000"/>
    <GlyphID id="1" name="x"/>
    <GlyphID id="2" name="num_"/>
    <GlyphID id="3" name="num_1"/>
    <GlyphID id="4" name="num_2"/>
    <GlyphID id="5" name="num_3"/>
    <GlyphID id="6" name="num_4"/>
    <GlyphID id="7" name="num_5"/>
    <GlyphID id="8" name="num_6"/>
    <GlyphID id="9" name="num_7"/>
    <GlyphID id="10" name="num_8"/>
    <GlyphID id="11" name="num_9"/>
  </GlyphOrder>
 
  ············
   
  <cmap>
    <tableVersion version="0"/>
    <cmap_format_4 platformID="0" platEncID="3" language="0">
      <map code="0x78" name="x"/><!-- LATIN SMALL LETTER X -->
      <map code="0xe602" name="num_"/><!-- ???? -->
      <map code="0xe603" name="num_1"/><!-- ???? -->
      <map code="0xe604" name="num_2"/><!-- ???? -->
      <map code="0xe605" name="num_3"/><!-- ???? -->
      <map code="0xe606" name="num_4"/><!-- ???? -->
      <map code="0xe607" name="num_5"/><!-- ???? -->
      <map code="0xe608" name="num_6"/><!-- ???? -->
      <map code="0xe609" name="num_7"/><!-- ???? -->
      <map code="0xe60a" name="num_8"/><!-- ???? -->
      <map code="0xe60b" name="num_9"/><!-- ???? -->
      <map code="0xe60c" name="num_4"/><!-- ???? -->
      <map code="0xe60d" name="num_1"/><!-- ???? -->
      <map code="0xe60e" name="num_"/><!-- ???? -->
      <map code="0xe60f" name="num_5"/><!-- ???? -->
      <map code="0xe610" name="num_3"/><!-- ???? -->
      <map code="0xe611" name="num_2"/><!-- ???? -->
      <map code="0xe612" name="num_6"/><!-- ???? -->
      <map code="0xe613" name="num_8"/><!-- ???? -->
      <map code="0xe614" name="num_9"/><!-- ???? -->
      <map code="0xe615" name="num_7"/><!-- ???? -->
      <map code="0xe616" name="num_1"/><!-- ???? -->
      <map code="0xe617" name="num_3"/><!-- ???? -->
      <map code="0xe618" name="num_"/><!-- ???? -->
      <map code="0xe619" name="num_4"/><!-- ???? -->
      <map code="0xe61a" name="num_2"/><!-- ???? -->
      <map code="0xe61b" name="num_5"/><!-- ???? -->
      <map code="0xe61c" name="num_8"/><!-- ???? -->
      <map code="0xe61d" name="num_9"/><!-- ???? -->
      <map code="0xe61e" name="num_7"/><!-- ???? -->
      <map code="0xe61f" name="num_6"/><!-- ???? -->
    </cmap_format_4>
     
    ············
 
  </cmap>
 
  ············
 
</ttFont>

3.查看映射关系

访问在线字体编辑网站(例如:https://font.qqe2.com/),上传字体查看数字映射关系
在这里插入图片描述

4.最终的映射

结合两个关系,得到最终的映射关系

private static Map<String, String> analyCode = new HashMap<>(0);
static {
    analyCode.put("0xe602", "1");
    analyCode.put("0xe603", "0");
    analyCode.put("0xe604", "3");
    analyCode.put("0xe605", "2");
    analyCode.put("0xe606", "4");
    analyCode.put("0xe607", "5");
    analyCode.put("0xe608", "6");
    analyCode.put("0xe609", "9");
    analyCode.put("0xe60a", "7");
    analyCode.put("0xe60b", "8");
    analyCode.put("0xe60c", "4");
    analyCode.put("0xe60d", "0");
    analyCode.put("0xe60e", "1");
    analyCode.put("0xe60f", "5");
    analyCode.put("0xe610", "2");
    analyCode.put("0xe611", "3");
    analyCode.put("0xe612", "6");
    analyCode.put("0xe613", "7");
    analyCode.put("0xe614", "8");
    analyCode.put("0xe615", "9");
    analyCode.put("0xe616", "0");
    analyCode.put("0xe617", "2");
    analyCode.put("0xe618", "1");
    analyCode.put("0xe619", "4");
    analyCode.put("0xe61a", "3");
    analyCode.put("0xe61b", "5");
    analyCode.put("0xe61c", "7");
    analyCode.put("0xe61d", "8");
    analyCode.put("0xe61e", "9");
    analyCode.put("0xe61f", "6");
}

具体事例

代码部分:
/**
 * 映射关系
 */
private static Map<String, String> analyCode = new HashMap<>(0);
static {
    analyCode.put("0xe602", "1");
    analyCode.put("0xe603", "0");
    analyCode.put("0xe604", "3");
    analyCode.put("0xe605", "2");
    analyCode.put("0xe606", "4");
    analyCode.put("0xe607", "5");
    analyCode.put("0xe608", "6");
    analyCode.put("0xe609", "9");
    analyCode.put("0xe60a", "7");
    analyCode.put("0xe60b", "8");
    analyCode.put("0xe60c", "4");
    analyCode.put("0xe60d", "0");
    analyCode.put("0xe60e", "1");
    analyCode.put("0xe60f", "5");
    analyCode.put("0xe610", "2");
    analyCode.put("0xe611", "3");
    analyCode.put("0xe612", "6");
    analyCode.put("0xe613", "7");
    analyCode.put("0xe614", "8");
    analyCode.put("0xe615", "9");
    analyCode.put("0xe616", "0");
    analyCode.put("0xe617", "2");
    analyCode.put("0xe618", "1");
    analyCode.put("0xe619", "4");
    analyCode.put("0xe61a", "3");
    analyCode.put("0xe61b", "5");
    analyCode.put("0xe61c", "7");
    analyCode.put("0xe61d", "8");
    analyCode.put("0xe61e", "9");
    analyCode.put("0xe61f", "6");
}
 
/**
 * 正则匹配表达式
 */
private static final Pattern PATTERN_NICKNAME = Pattern.compile("<p class=\"nickname\">(.*?)<");
private static final Pattern PATTERN_SIGNATURE = Pattern.compile("<p class=\"signature\">([\\S\\s]*?)<");
 
private static final Pattern PATTERN_ID = Pattern.compile("<p class=\"shortid\">抖音ID:(.*?)<");
private static final Pattern PATTERN_ID_BLOCK = Pattern.compile("<p class=\"shortid\">([\\S\\s]*?)</p>");
private static final Pattern PATTERN_ICON_FONT = Pattern.compile("<i class=\"icon iconfont \"> (.*?) </i>");
 
private static final Pattern PATTERN_FOCUS_BLOCK = Pattern.compile("<span class=\"focus block\"><span class=\"num\">(.*?)</span>");
private static final Pattern PATTERN_FANS_BLOCK = Pattern.compile("<span class=\"follower block\"><span class=\"num\">(.*?)</span>");
private static final Pattern PATTERN_LIKE_NUM_BLOCK = Pattern.compile("<span class=\"liked-num block\"><span class=\"num\">(.*?)</span>");
private static final Pattern PATTERN_FOLLOW_NUM = Pattern.compile("<i class=\"icon iconfont follow-num\"> (.*?) </i>|\\.|w ");
 
private static final Pattern PATTERN_POST_BLOCK = Pattern.compile("<div class=\"user-tab active tab get-list\" data-type=\"post\">作品<span class=\"num\">(.*?)</span>");
private static final Pattern PATTERN_LIKE_BLOCK = Pattern.compile("<div class=\"like-tab tab get-list\" data-type=\"like\">喜欢<span class=\"num\">(.*?)</span>");
private static final Pattern PATTERN_TAB_NUM = Pattern.compile("<i class=\"icon iconfont tab-num\"> (.*?) </i>");
 
/**
 * 正则匹配获取基础信息
 *
 * @param homepageHtml 页面HTML代码
 * @param pattern      基础信息正则表达式
 * @return java.lang.StringBuilder
 * @author LKET
 * @date 2019/11/21 下午3:51
 */
private static String getUserInfo(String homepageHtml, Pattern pattern) {
    String info = "";
    Matcher matcher = pattern.matcher(homepageHtml);
    if (matcher.find()) {
        info = matcher.group(1).trim();
    }
    return info;
}
 
/**
 * 正则匹配获取真实数值
 *
 * @param homepageHtml 页面HTML代码
 * @param blockPattern 外层class正则表达式
 * @param numPattern   数值class正则表达式
 * @return java.lang.StringBuilder
 * @author LKET
 * @date 2019/11/21 下午3:51
 */
private static StringBuilder getTrueNum(String homepageHtml, Pattern blockPattern, Pattern numPattern) {
    StringBuilder trueNum = new StringBuilder();
    Matcher matcherBlock = blockPattern.matcher(homepageHtml);
    if (matcherBlock.find()) {
        Matcher matcherNumList = numPattern.matcher(matcherBlock.group(1));
        while (matcherNumList.find()) {
            // 判断是否包含i标签,包含转数字,不包含则为.w字符
            if (matcherNumList.group(0).contains("<i")) {
                String code = matcherNumList.group(1).replace("&#", "0").replace(";", "");
                String number = analyCode.get(code);
                trueNum.append(number);
            } else {
                trueNum.append(matcherNumList.group(0));
            }
        }
    }
    return trueNum;
}
 
/**
 * 获取抖音用户基本数据
 */
public static void main(String[] args) {
    try {
        // 请求用户详情页获取HTML代码
        String homepageHtml = doGet("https://www.iesdouyin.com/share/user/76725372134?utm_campaign=client_share&app=aweme&utm_medium=ios&tt_from=copy&utm_source=copy");
        System.out.println(homepageHtml);
        // 输出数据(抖音号有英文、数值两种类型,分开处理)
        String nickname = getUserInfo(homepageHtml, PATTERN_NICKNAME);
        System.out.println("昵称:" + nickname);
        String id = getUserInfo(homepageHtml, PATTERN_ID);
        if (id.isEmpty()) {
            id = getTrueNum(homepageHtml, PATTERN_ID_BLOCK, PATTERN_ICON_FONT).toString();
        }
        System.out.println("抖音id:" + id);
        String signature = getUserInfo(homepageHtml, PATTERN_SIGNATURE);
        System.out.println("用户签名:" + signature);
        StringBuilder focusNum = getTrueNum(homepageHtml, PATTERN_FOCUS_BLOCK, PATTERN_FOLLOW_NUM);
        System.out.println("粉丝数:" + focusNum);
        StringBuilder fansNum = getTrueNum(homepageHtml, PATTERN_FANS_BLOCK, PATTERN_FOLLOW_NUM);
        System.out.println("粉丝数:" + fansNum);
        StringBuilder likeNumNum = getTrueNum(homepageHtml, PATTERN_LIKE_NUM_BLOCK, PATTERN_FOLLOW_NUM);
        System.out.println("点赞数:" + likeNumNum);
        StringBuilder postNum = getTrueNum(homepageHtml, PATTERN_POST_BLOCK, PATTERN_TAB_NUM);
        System.out.println("作品数:" + postNum);
        StringBuilder likeNum = getTrueNum(homepageHtml, PATTERN_LIKE_BLOCK, PATTERN_TAB_NUM);
        System.out.println("喜欢数:" + likeNum);
    } catch (Exception e) {
        System.out.println(e);
    }
}
运行结果:

在这里插入图片描述

昵称:罗志祥
抖音id:ShowLoGNF
用户签名:抖音沙雕谁最强 就是本人罗志祥
粉丝数:1
粉丝数:3743.3w
点赞数:31061.5w
作品数:239
喜欢数:656

在这里插入图片描述

昵称:仙女酵母
抖音id:1602606308
用户签名:三界接线员等你来打call
wb:@仙女酵母
个人wx:xnjmxiu8034
粉丝数:22
粉丝数:1416.6w
点赞数:15264.5w
作品数:227
喜欢数:154
发布了4 篇原创文章 · 获赞 0 · 访问量 11

猜你喜欢

转载自blog.csdn.net/LLKET/article/details/104940934