spark代码
写代码前需要准备工作
spark关于maven依赖
groupId = org.apache.spark
artifactId = spark-core_2.11
version = 2.3.3
Hadoop关于maven依赖
groupId = org.apache.hadoop
artifactId = hadoop-client
version = <your-hdfs-version>
简单代码示例
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
object Test{
def main(args: Array[String]) {
val conf = new SparkConf().setAppName(appName).setMaster(master)
val sc = new SparkContext(conf)
lines = sc.textFile("hdfs://...")
val errors = lines.filter(_.startsWith("ERROR"))//每条记录可以用"_"来表示
errors.persist()//缓存
val mySql_errors = errors.filter(_.contain("MySQL")).count
val http_errors = errors.filter(_.contain("Http")).count
}
}
以上代码的执行逻辑为找出数据中以MySQL和Http开头的数据分别为多少条,每一个action就是一个job,因此这里有两个job
常用函数
parallelize(本地集合)并行化,将本地集合变成一个rdd