mongodb中文文本数据（新闻评论）预处理代码（python+java）

中文文本数据预处理

Mongodb数据导出到txt文档
将文件按行写入数组
文本批量修改（加后缀等）

Mongodb数据导出到txt文档

#python
# coding=utf-8
from pymongo import MongoClient

# 建立 MongoDB 数据库连接
client = MongoClient('localhost', 27017)

# 连接所需数据库,news为数据库名
db = client.news

# 连接所用集合，也就是我们通常所说的表
collection = db.news_comment2_600w

with open("comment.txt", 'w+', encoding='UTF-8') as f:
        for txt in collection.find({"url_hash": "aad54fce101da05eb1688ef0389a8e84559f4fdf"}, {"text": 1}):
            if 'text' in txt and txt['text']:
                result = txt['text']+"\n"  # 按行读入
                f.writelines(result)

将文件按行写入数组

#python
class StrToArr:
    @staticmethod
    def cn_str_to_arr(path, temp):
        with open(path, "r", encoding='GBK') as ad_file:
            for i in ad_file:
                temp.append(i)

#java
private List<String> segLines(File file) throws Exception {
        BufferedReader bf = new BufferedReader(new InputStreamReader(new FileInputStream(file), "GBK"));
        List<String> temp = new ArrayList<>();
        String str;
        while ((str = bf.readLine()) != null) {
            //String str2 = str + "\r\n";
            temp.add(str);
        }
        bf.close();
        return temp;

##数据去重，去非中文文本（表情，英文等），过滤无用信息

#python
def pretreatment():
    temp = []
    with open("comment.txt", "r", encoding='UTF-8') as pre_file:
        for i in pre_file:
            # pattern = re.compile(r'.*?([\u4E00-\u9FA5]+造谣.*)')
            pattern = re.compile(r'.*?((造谣.*)|(网易.*)|(没死.*)|(媒体.*)|(小编.*)|(小便.*))')
            bo_ol = pattern.match(i)
            # print(bo_ol)
            if i not in temp and bo_ol is None:  # 去重和正则表达除去无用的信息
                content = re.sub(r' ', '', i)   # 去空格
                sub_not_cn = re.sub(u'[^\n\w\u4E00-\u9FA5]+', '', content)
                temp.append(sub_not_cn)
    with open("uni_data.txt", "w+", encoding='UTF-8') as uni_file:
        for j in range(len(temp)):
            result = temp[j]
            uni_file.writelines(result)


if __name__ == '__main__':
    pretreatment()

文本批量修改（加后缀等）

#python
def add_ad():
    temp = []
    with open("***.txt", "r", encoding='GBK') as ad_file:
        for i in ad_file:
            t = i.replace("\n", "  "+"ad")
            temp.append(t+"\n")
    with open("ad_word.txt", "w", encoding="GBK") as f:
        for i in range(len(temp)):
            result = temp[i]
            f.writelines(result)


if __name__ == "__main__":
    add_ad()

个人认为，Java功能确实很强大，但python在自然处理的方面确实也有不俗的能力，在读取文件的部分，python更为方便。特别是读取excel表格时，python可以完美的将每一列或者每一行作为索引，直接在文档中操作数据定位，关联等，而Java相对比较繁琐。
对中文文本的情感分析的代码，以及超级工具包感兴趣的可以看下一篇文章，我打算把自己半个月折腾的成果分享出来，不算精彩却很使用，便于二次开发。

mongodb中文文本数据（新闻评论）预处理代码（python+java）

中文文本数据预处理

Mongodb数据导出到txt文档

将文件按行写入数组

文本批量修改（加后缀等）

猜你喜欢