CVE 漏洞库文件 allitems.csv 解析攻略

背景

CVE 的英文全称是“Common Vulnerabilities & Exposures” 通用漏洞披露，官网会定期纰漏最新漏洞信息，态势行业安全规范明确要求，必须支持常规漏洞库的数据，http://cve.mitre.org/ 就是其一。

本文将详细介绍 CVE 漏洞文件 allitems.csv 的解析过程，它的难点是内容并不完全符合 csv 规范，对于非常规 csv 信息，需要利用正则单独解析。

allitems.csv 格式说明

官网下载该文件，分析发现它的头十行是文件描述信息，不需要解析：

"CVE Version 20061101",,,,,
"Date: 20200814",,,,,
"Name","Status","Description","References","Phase","Votes","Comments"
"Candidates must be reviewed and accepted by the CVE Editorial Board",,,,,,
"before they can be added to the official CVE list.  Therefore, these",,,,,,
"candidates may be modified or even rejected in the future.  They are",,,,,,
"provided for use by individuals who have a need for an early",,,,,,
"numbering scheme for items that have not been fully reviewed by",,,,,,
"the Editorial Board.",,,,,,

第三行是文件的 title 信息，总共 7 列：

"Name","Status","Description","References","Phase","Votes","Comments"

对该文件的解析，就是对每一行数据按照该 title ，得到各项的值。

常规记录处理

直接用逗号分割数据，总列数等于 7 的数据，替换双引号为空，就可以直接用了。

双引号文本处理

从第 11 行开始解析，直接用逗号分割数据，发现有很多数据总列数超过 7 ，为什么呢？

因为第三、四列和六、七列的信息，都是用双引号引用的，内容中有些还会有分隔符 “，” 所以它们还需要进行二次解析。

例如，这条非常规的信息，真正拆解的 7 列是这样的：

CVE-1999-0008,
Entry,
"Buffer overflow in NIS+, in Sun's rpc.nisd program.",
"CERT:CA-98.06.nisd   |   ISS:June10,1998   |   SUN:00170   |   URL:http://sunsolve.sun.com/pub-cgi/retrieve.pl?doctype=coll&doc=secbull/170   |   XF:nisd-bo-check",
,
,""
,""

因为第三行 NIS+, 这里有一个逗号，所以直接分割后数据会混乱，必须进行二次处理。处理方法是，用正则解析 ,"", 之间的字符，得到真正的文本内容。

 // 双引号引用的内容，包含前后的逗号的正则
private static final String regEx_DoubleQuote = "(?<=,\").*?(?=\",)";

    /**
     * CVE 文件中直接用逗号误无法解析的部分，符合 ,"   ", , "", 格式的，用正则匹配
     * @param content
     * @return
     */
    public static List<String> findByDoubleQuote(String content) {
        Pattern p_html = Pattern.compile(regEx_DoubleQuote, Pattern.CASE_INSENSITIVE);
        Matcher matcher = p_html.matcher(content);
        List<String> result = new ArrayList();
        while(matcher.find()) {
            String group = matcher.group();
            result.add(group);
        }
        return result;
    }

stage 列解析

stage 列位于 References 的后面，它的信息不是用双引号引起来的：

CVE-1999-0078,
Candidate,
"pcnfsd (aka rpc.pcnfsd) allows local users to change file permissions, or execute arbitrary commands through arguments in the RPC call.","CERT:CA-96.08.pcnfsd   |   XF:rpc-pcnfsd",
Modified (19990621),
"   ACCEPT(5) Collins, Frech, Landfield, Northcutt, Shostack  |     NOOP(1) Baker  |     RECAST(1) Christey",
"Christey> This candidate should be SPLIT, since there are two separate  |    software flaws.  One is a symlink race and the other is a  |    shell metacharacter problem.  |    Christey> The permissions part of this vulnerability appears to  |    overlap with CVE-1999-0353  |    Christey> SGI:20020802-01-I"

所以需要单独再解析，解析思路也是，它位于 ",," 之间，用正则抽取。编写解析代码如下：

private static String regEx_CommaQuote = "(?<=\",).*?(?=,\")"; // ",Assigned (20160407)," 解析 Stage 的内容

/**
     * stage 列的信息，是普通csv列 的，但它前后的信息都用双引号包裹的，
     * ,"",Assigned (20160407),"None (candidate not yet proposed)",""
     * 单独解析
     * @param content
     * @return
     */
    public static String findByCommaQuote(String content) {
        Pattern p_html = Pattern.compile(regEx_CommaQuote, Pattern.CASE_INSENSITIVE);
        Matcher matcher = p_html.matcher(content);
        List<String> result = new ArrayList();
        while(matcher.find()) {
            String group = matcher.group();
            result.add(group);
        }

        // 如果是 ,"",Assigned (20160407),"None (candidate not yet proposed)",""
        String stage = result.get(0);
        if(result.size()>1) {
            // 解析到两个，说明前面内容不规则 取第二部分：,"xxxx","xxxx",Assigned (20160407),
            stage = result.get(1);
        }

        // 找到最后一个逗号，解析得到 stage 信息
        int lastCommaIndex = stageBase.lastIndexOf(",");
        if(lastCommaIndex > -1) {
            return stage.substring(lastCommaIndex+1);
        }

        return stage;
    }

完整流程

整个解析流程为：

第一步，从第11行开始，直接用逗号分割。
    判断分割后的数据总长度是否是 7 列；
    如果是，替换文本中的双引号，解析完成；
    否则继续；

第二步，超过 7 列的文本在处理，调用 findByCommaQuote 
    解析得到第三、四、六、七列引号中的数据。

第三步，单独解析 stage 列，调用 findByCommaQuote 完成。

启示录

allitems.csv 总共有 18 万条数据，必须使用多线程解析，否则效率愁人呐！