python中使用jieba分词库编写spark中文版WordCount

配置环境的链接:spark2.3在window10当中来搭建python3的使用环境pyspark

编写使用的IDE是pycharm

进入WordCount.py文件写入如下代码,就是中文版WordCount,很经典的分布式程序,需要用到中文分词库jieba,去除停用词再进行计数

from pyspark.context import SparkContext
import jieba
# from pyspark.sql.session import SparkSession
# from pyspark.ml import Pipeline
# from pyspark.ml.feature import StringIndexer, VectorIndexer
sc = SparkContext("local", "WordCount")   #初始化配置
data = sc.textFile(r"D:\WordCount.txt")   #读取是utf-8编码的文件
with open(r'd:\中文停用词库.txt','r',encoding='utf-8') as f:
    x=f.readlines()
stop=[i.replace('\n','') for i in x]
stop.extend([',','的','我','他','','。',' ','\n','?',';',':','-','(',')','!','1909','1920','325','B612','II','III','IV','V','VI','—','‘','’','“','”','…','、'])#停用标点之类
data=data.flatMap(lambda line: jieba.cut(line,cut_all=False)).filter(lambda w: w not in stop).\
    map(lambda w:(w,1)).reduceByKey(lambda w0,w1:w0+w1).sortBy(lambda x:x[1],ascending=False)
print(data.take(100))

输出结果为:

C:\Anaconda3.5.2.0\python.exe D:/Project/WordCount.py
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.hadoop.security.authentication.util.KerberosUtil (file:/D:/spark/jars/hadoop-auth-2.7.3.jar) to method sun.security.krb5.Config.getInstance()
WARNING: Please consider reporting this to the maintainers of org.apache.hadoop.security.authentication.util.KerberosUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
2100-01-01 10:00:00 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2100-01-01 10:00:00 WARN  Utils:66 - Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
[Stage 0:>                                                          (0 + 1) / 1]Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\13307\AppData\Local\Temp\jieba.cache
Loading model cost 0.787 seconds.
Prefix dict has been built succesfully.
[('小王子', 419), ('说', 317), ('一个', 211), ('没有', 200), ('说道', 134), ('星星', 118), ('星球', 104), ('会', 98), ('回答', 91), ('地方', 80), ('国王', 78), ('画', 74), ('狐狸', 72), ('知道', 68), ('中', 67), ('花', 64), ('羊', 62), ('一只', 61), ('道', 57), ('非常', 56), ('看到', 53), ('命令', 52), ('有点', 50), ('这是', 48), ('不会', 48), ('朋友', 47), ('沙漠', 46), ('走', 46), ('地理学家', 46), ('.', 45), ('时', 43), ('想', 42), ('事', 42), ('感到', 42), ('行星', 42), ('问题', 41), ('可能', 40), ('真', 40), ('重要', 39), ('猴面包树', 38), ('&#', 38), ('39', 38), (';', 38), ('时间', 37), ('象', 36), ('问', 36), ('笑', 36), ('地球', 36), ('里', 35), ('爱', 34), ('花儿', 34), ('这种', 32), ('喜欢', 32), ('做', 32), ('蛇', 32), ('驯服', 32), ('一点', 31), (':', 31), ('看着', 30), ('一种', 30), ('发现', 30), ('一定', 30), ('一颗', 30), ('\u3000', 30), ('你好', 30), ('点灯', 30), ('探察', 30), ('大人', 29), ('家', 29), ('东西', 28), ('看见', 28), ('好象', 28), ('这位', 28), ('提出', 28), ('问道', 28), ('应该', 28), ('吃', 28), ('一天', 28), ('请', 27), ('住', 27), ('起来', 27), ('现在', 27), ('奇怪', 26), ('从来', 26), ('已经', 26), ('明白', 26), ('朵花', 26), ('路灯', 26), ('寻找', 26), ('十分', 24), ('小家伙', 24), ('是从', 24), ('地说', 24), ('年', 24), ('自言自语', 24), ('虚荣', 24), ('生活', 22), ('严肃', 22), ('工作', 22), ('想要', 22)]

Process finished with exit code 0

最终结果是:

[('小王子', 419), ('说', 317), ('一个', 211), ('没有', 200), ('说道', 134), ('星星', 118), ('星球', 104), ('会', 98), ('回答', 91), ('地方', 80), ('国王', 78), ('画', 74), ('狐狸', 72), ('知道', 68), ('中', 67), ('花', 64), ('羊', 62), ('一只', 61), ('道', 57), ('非常', 56), ('看到', 53), ('命令', 52), ('有点', 50), ('这是', 48), ('不会', 48), ('朋友', 47), ('沙漠', 46), ('走', 46), ('地理学家', 46), ('.', 45), ('时', 43), ('想', 42), ('事', 42), ('感到', 42), ('行星', 42), ('问题', 41), ('可能', 40), ('真', 40), ('重要', 39), ('猴面包树', 38), ('&#', 38), ('39', 38), (';', 38), ('时间', 37), ('象', 36), ('问', 36), ('笑', 36), ('地球', 36), ('里', 35), ('爱', 34), ('花儿', 34), ('这种', 32), ('喜欢', 32), ('做', 32), ('蛇', 32), ('驯服', 32), ('一点', 31), (':', 31), ('看着', 30), ('一种', 30), ('发现', 30), ('一定', 30), ('一颗', 30), ('\u3000', 30), ('你好', 30), ('点灯', 30), ('探察', 30), ('大人', 29), ('家', 29), ('东西', 28), ('看见', 28), ('好象', 28), ('这位', 28), ('提出', 28), ('问道', 28), ('应该', 28), ('吃', 28), ('一天', 28), ('请', 27), ('住', 27), ('起来', 27), ('现在', 27), ('奇怪', 26), ('从来', 26), ('已经', 26), ('明白', 26), ('朵花', 26), ('路灯', 26), ('寻找', 26), ('十分', 24), ('小家伙', 24), ('是从', 24), ('地说', 24), ('年', 24), ('自言自语', 24), ('虚荣', 24), ('生活', 22), ('严肃', 22), ('工作', 22), ('想要', 22)]


猜你喜欢

转载自blog.csdn.net/shiheyingzhe/article/details/80718811