结巴并行分词

其他 2018-04-28 14:54:37 阅读次数: 3

源文件有4列

import os
import sys


import pandas as pd
from joblib import Parallel, delayed
import jieba

import yaml
config = yaml.load(open('config.yaml', 'r'))


def read_df(trainfile):
    data = pd.read_csv(trainfile, sep='\\t', header=None, nrows=60000,
                       encoding='utf-8', names=['id', 'title', 'content', 'label'])
    return data


def word_cut(df):
    with open(config['train_cut'], 'a+') as f:
        line = '\t'.join([df[0],' '.join(jieba.cut(df[1])) ,' '.join(jieba.cut(df[2])),df[3]])   
        f.writelines(line)
        f.writelines('\n')


def applyParallel(content, func, n_thread):
    with Parallel(n_jobs=n_thread) as parallel:
        parallel(delayed(func)(c) for c in content)


def main():
    overwrite = True
    if overwrite:
        if os.path.exists(config['train_cut']):
            os.remove(config['train_cut'])

    trainfile = 'data/train_fusai.tsv'
    df = read_df(trainfile)
    content = df.values
    applyParallel(content, word_cut, 22)
if __name__ == '__main__':
    main()

猜你喜欢

转载自www.cnblogs.com/zle1992/p/8967644.html

结巴并行分词

结巴分词

结巴分词——学习笔记

python结巴分词

结巴分词词性

Python 进行结巴分词

结巴分词python

结巴分词的使用

结巴分词基础用法

结巴分词原理介绍

jieba结巴分词

结巴分词原理

python 结巴分词学习

结巴分词参考地址

结巴分词具体使用

结巴分词使用实例

python结巴(jieba)分词

结巴分词 (转载)

jieba GitHUb 结巴分词

结巴分词入门

Python jieba[结巴分词]

NLP:结巴分词的使用

【NLP】之结巴分词

结巴分词基础

结巴分词 - - - jieba库

python 结巴分词(jieba)学习

结巴中文分词介绍

结巴分词较好，可借鉴

简单的结巴分词与词频统计

快速上手结巴分词

今日推荐

富文本编辑器 Quill 2.0 重磅发布，特性、可靠性与开发者体验大幅提升

“开源信徒”周鸿祎开源360智脑大模型

周排行

Ubuntu 14.04 下Fuel6.0安装部署

香港一小巴侧翻致1死16伤警方：未见机件故障

pikachu--XSS盲打

阅读深入理解JVM虚拟机笔记一

java.sql.SQLException: ORA-00932: 数据类型不一致: 应为 -, 但却获得 CLOB

oracle delete all object under an user

[LeetCode]20 Valid Parentheses 有效的括号

树形DP求树的直径【模板】

Context propagation over HTTP in Go

【PAT】（B）1053 住房空置率 (20)*

每日归档

2024-04-18(0)

2024-04-17(5)

2024-04-16(70)

2024-04-15(42)

2024-04-14(0)

2024-04-13(119)

2024-04-12(38)

2024-04-11(14)

2024-04-10(68)

2024-04-09(5)