创新实训（13）——文本相似度算法的实现

前言

由于推荐系统要在每一篇博客下面，推荐与他相似的其他博客，所以需要用到有关文本相似度的计算，我需要先提前实现好文本相似度的计算算法。这里选用了计算文本的余弦相似度，因为余弦相似度更注重两端文本在向量上的差异，而不是在距离上或者长度上的差异，它能不受坐标轴旋转，放大缩小的影响。

算法步骤

（1）将两个文本进行分词
在这里插入图片描述
（2）列出两个文本分词后的所有的词

（3）对于每句话，在两个文本分词后的所有的词中，计算词频

（4）使用词袋法，写出词频向量

（5）使用这两个向量计算余弦相似度

（6）结果：
余弦相似度越大，表示越相似。

问题：
解释一下为什么没有使用TF-IDF作为向量，计算文本的余弦相似度：因为计算TF-IDF需要对系统内的所有博客，建立倒排索引，才能计算IDF值，而简单的向量，只需要针对两句话进行计算即可。

算法实现

（1）将两句话进行分词，分词的结果记录在Word类中
这里我使用了一个GitHub开源的分词库，叫做HanLP。

https://github.com/hankcs/HanLP

public class Word implements Comparable {
    // 词名
    private String name;
    // 词性
    private String pos;
    // 权重，用于词向量分析
    private Float weight;
	//词频
    private int frequency;

    public Word(String name) {
        this.name = name;
    }

    public Word(String name, String pos) {
        this.name = name;
        this.pos = pos;
    }
   }

public class Word implements Comparable {
    // 词名
    private String name;
    // 词性
    private String pos;
    // 权重，用于词向量分析
    private Float weight;

    private int frequency;

    public Word(String name) {
        this.name = name;
    }

    public Word(String name, String pos) {
        this.name = name;
        this.pos = pos;
    }

（2）列出两个文本分词后的所有的词

    /**
     *
     * @param words1
     * @param words2
     */
    protected static void taggingWeightByFrequency(List<Word> words1, List<Word> words2) {
        if (words1.get(0).getWeight() != null || words2.get(0).getWeight() != null) {
            return;
        }//AtomicInteger java的并发原子类
        Map<String, AtomicInteger> frequency1 = getFrequency(words1);
        Map<String, AtomicInteger> frequency2 = getFrequency(words2);
        // 标注权重
        words1.parallelStream().forEach(word -> word.setWeight(frequency1.get(word.getName()).floatValue()));
        words2.parallelStream().forEach(word -> word.setWeight(frequency2.get(word.getName()).floatValue()));
    }

    /**
     * 统计词频
     *
     * @param words 词列表
     * @return 词频统计图
     */
    private static Map<String, AtomicInteger> getFrequency(List<Word> words) {
        Map<String, AtomicInteger> freq = new HashMap<>();
        for(Word word : words)
        {
            // key存在，则不操作，key不存在，则赋值一对新的（key，value）
            freq.computeIfAbsent(word.getName(),k -> new AtomicInteger()).incrementAndGet();
        }
        return freq;
 }

（3）对于每句话，在两个文本分词后的所有的词中，计算词频

  /**
     * 构造权重快速搜索容器
     *
     * @param words
     * @return
     */
    protected static Map<String, Float> getFastSearchMap(List<Word> words) {
        //线程安全的ConcurrentHashMap
        Map<String, Float> weightMap = new ConcurrentHashMap<>();
        if (words == null) {
            return weightMap;
        }
        words.parallelStream().forEach(i -> {
            if (i.getWeight() != null) {
                weightMap.put(i.getName(), i.getWeight());
            } else {
                LOGGER.error("no word weight info:" + i.getName());
            }
        });
        return weightMap;
    }

（4）使用词袋法，写出词频向量，使用这两个向量计算余弦相似度

    public static double getCosineSimilarity(List<Word> words1, List<Word> words2) {
        // 词频标注词的权重
        taggingWeightByFrequency(words1, words2);
        // 权重容器   可以根据词语查询权重
        Map<String, Float> weightMap1 = getFastSearchMap(words1);
        Map<String, Float> weightMap2 = getFastSearchMap(words2);
        Set<Word> words = new HashSet<>();
        words.addAll(words1);
        words.addAll(words2);
        //保证计数原子性的容器
        // a.b
        AtomicFloat ab = new AtomicFloat();
        //|a|的平方
        AtomicFloat aa = new AtomicFloat();
        // |b|的平方
        AtomicFloat bb = new AtomicFloat();
        // 计算
        words.parallelStream().forEach(word -> {
            Float x1 = weightMap1.get(word.getName());
            Float x2 = weightMap2.get(word.getName());
            if (x1 != null && x2 != null) {
                //x1x2
                float oneOfTheDimension = x1 * x2;
                //+
                ab.addAndGet(oneOfTheDimension);
            }
            if (x1 != null) {
                //(x1)^2
                float oneOfTheDimension = x1 * x1;
                //+
                aa.addAndGet(oneOfTheDimension);
            }
            if (x2 != null) {
                //(x2)^2
                float oneOfTheDimension = x2 * x2;
                //+
                bb.addAndGet(oneOfTheDimension);
            }
        });
        //|a|
        double aaa = Math.sqrt(aa.doubleValue());
        //|b|
        double bbb = Math.sqrt(bb.doubleValue());
        //使用BigDecimal保证精确计算浮点数
        BigDecimal aabb = BigDecimal.valueOf(aaa).multiply(BigDecimal.valueOf(bbb));
        double cos = BigDecimal.valueOf(ab.get()).divide(aabb, 9, BigDecimal.ROUND_HALF_UP).doubleValue();
        return cos;
    }

计算结果

在这里插入图片描述