TF-IDF自定义词典库的设计以及IDF的统计

什么是TF-IDF？

哎呀，能看到这篇个人日志，估计你早已明白tf-idf了吧。下面简单啰嗦一下凑字数。
tf：是指当前文本中出现这个单词的频次，在这个文本里面出现越多当然越重要啦。
idf：是指n个文本中有多少个文本出现过这个单词，越特殊越重要的，也就是出现在越少的文本中重要。
tf*idf构成了单词的权重，就我理解这还是比较科学的哈哈哈。

思路流程图

诶，我是没搞好呢，把自己的想法写出来，供大家批判一下是否合理。下面是简单的文本处理流程图

文本基本预处理详细

这部分主要对中文进行预处理，包括去空格、分词、去停用词、保留自定义关键字、去标点符号。当然大佬们应该还会考虑拼写纠错、近义词之类的吧。

去空格

去掉文本中的空格
input:contents为list文本数据
output:去除空格后的文本list

def remove_blank_space(contents):
    contents_new = map(lambda s: s.replace(' ', ''), contents)
    return list(contents_new)

分词

对文本进行jieba分词，这里输入是上一步输出的文本list

def cut_words(contents):
    cut_contents = map(lambda s: list(jieba.lcut(s)), contents)
    return list(cut_contents)

去除停用词

这里停用词我用了百度的停用词，还有一些个人需要的停用词。同时还加了关键词概念，我怕别把一些我需要的词过滤掉了。

def drop_stopwords(contents):
 # 初始化获取停用词表
 stop = open('data/stop_word_cn.txt', 'r+', encoding='utf-8')
 stop_me = open('./data/stop_me.txt', 'r+', encoding='utf-8')
 key_words = open('./data/key_words.txt', 'r+', encoding='utf-8')
 #分割停用词/自定义停用词/关键词
 stop_words = stop.read().split("\n")
 stop_me_words = stop_me.read().split("\n")
 key_words = key_words.read().split("\n")
   #定义返回后的结果
 contents_new = []
 #遍历处理数据
 for line in contents:
     line_clean = []
     for word in line:
         if (word in stop_words or word in stop_me_words) and word not in key_words:
             continue
         if is_chinese_words(word):
             line_clean.append(word)
     contents_new.append(line_clean)
 return contents_new

上面is_chineses_words是判断分词是否是中文，如果是则记录，不是则不记录，主要是对标点符号进行过滤。

词典生成以及IDF的计算

词典生成想法

词典我之前考了一些文档，感觉都不太适合我。于是我采用下面的本方法。

将所有分词结果统计出来
然后计算他们的IDF导出，根据IDF在人工过滤一遍。

计算IDF以及词库

统计字典出现的单词以及单词出现的文件个数。输入是上面处理过的文本list，输出是词典和单词出现文本个数。

def deal_contents(contents):
 # 定义记录idf的数据
 word_counts = {
    
    }
 # 定义词典
 dict = []
 for content in contents:
     idf_flag = []
     for word in content:
     # 第一次出现词
         if word not in dict:
             dict.append(word)
             idf_flag.append(word)
             word_counts[word] = 1
         # 在短句中第一次出现
         elif word not in idf_flag:
             word_counts[word] = word_counts[word] + 1
 return dict, word_counts

经过上面方法处理过后，只需要按照公式再次计算一下即可获得idf值啦。输入size为文本总个数，dict词典list，word_counts字典结构单词文件数。输出时pandas的DataFrame，这个是方便我后面的数据处理的，也便于导出csv。

def calc_idf(size, dict, word_counts):
 idf = []
 for word in dict:
     in_list = [word, size / word_counts[word]]
     idf.append(in_list)
 return pd.DataFrame(idf, columns=['word', 'idf'])

拿到处理结果后，我打算根据idf的值由大到小撸一遍，很耗时，但我觉得还是比较有效果的。

基于TF-IDF算法，来创建自己的词典库（文本预处理并结合关键词库）