1.1、创建测试文件
$ cd ~/ipynotebook/
$ mkdir data
$ cd data/
$ vim word.txt
$ tail word.txt
hadoop spark hive
hive java python
spark perl hadoop
python RDD spark
RDD
1.2、编写spark wordcount程序
$ vim wordcount.py
#!/usr/bin/env python
from pyspark import SparkContext, SparkConf
conf = SparkConf().setMaster("local").setAppName("pyspark WordCount")
sc = SparkContext(conf = conf)
textFile = sc.textFile("data/word.txt")
stringRDD = textFile.flatMap(lambda line:line.split(" "))
countsRDD = stringRDD.map(lambda word:(word,1)).reduceByKey(lambda x,y:x+y)
countsRDD.saveAsTextFile("data/output")
$ spark-submit wordcount.py
$ cd ~/ipynotebook/data/
$ tree
.
├── output
│ ├── part-00000
│ └── _SUCCESS
└── word.txt
1 directory, 3 files
$ tail output/part-00000
('hadoop', 2)
('spark', 3)
('hive', 2)
('java', 1)
('python', 2)
('perl', 1)
('RDD', 2)
('', 1)