1. Open pycharm
2. Unzip hadoop, unzip it under windows, remember not to have a Chinese path
3. Unzip spark, unzip it under windows, remember not to have a Chinese path
4. Configure the environment variables corresponding to haoop and sprk into pycharm
4.1 Create a new project
4.2 Create a new python file in the project
4.3 Add hadoop to pycharm
HADOOP_HOME
4.4 put the winutils.exe plug-in under hadoop/bin
4.5 Add spark to pycharm
SPARK_HOME、PYTHONPATH
5. Install the plugin
6. Test
6.1 Put the following code into our newly created testspark.py file in step 4.2
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("WordCount").getOrCreate()
spark.sparkContext.textFile("file:///D:/ruanjian/spark/spark-2.4.6-bin-hadoop2.7/README.md")\
.flatMap(lambda x: x.split(' '))\
.map(lambda x: (x, 1))\
.reduceByKey(lambda x, y: x + y)\
.foreach(print)
caution caution caution
6.2 Install pyspark and findspark
6.3 Testing