Scala进击大数据Spark生态圈，进击Spark生态圈必备，迈向“高薪”的基石

http://spark.apache.org/docs/latest/programming-guide.html　　　　后面懒得翻译了，英文记的，以后复习时再翻。

摘要：每个Spark application包含一个driver program 来运行main 函数，在集群上进行各种并行操作。 RDD是Spark的核心。除了RDD，Spark的另一个抽象时并行操作中使用的两种 shared variables： broadcast variables和accumulators.

Spark’的shell ： bin/spark-shell ( Scala ) ; bin/pyspark ( Python ).

0.Linking with spark＝>Initialing spark=>programming=>submit

首先要创建一个SparkContext object, 来告诉 Spark 怎样接入一个集群（cluster），创建一个SparkContext之前还要先创建一个SparkConf object t包含application信息，如下.

val conf = new SparkConf().setAppName(appName).setMaster(master)new SparkContext(conf)

appName是你的app在集群UI上的名字。master参数包括： Spark, Mesos or YARN cluster URL, or a special “local” string to run in local mode.

PS：在Spark shell中, 一个特别的 SparkContext 已经创建好，名字为sc。想让自己的SparkContext 工作需要使用--master 命令。

Once you package your application into a JAR (for Java/Scala) or a set of .py or .zip files (for Python), the bin/spark-submit script lets you submit it to any supported cluster manager.

Spark is friendly to unit testing with any popular unit test framework. Simply create a SparkContext in your test with the master URL set to local, run your operations, and then call SparkContext.stop() to tear it down. Make sure you stop the context within a finally block or the test framework’s tearDown method, as Spark does not support two contexts running concurrently in the same program.

1.RDD

1）Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.

Parallelized collections are created by calling SparkContext’s parallelize method on an existing collection in your driver program (a Scala Seq). The elements of the collection are copied to form a distributed dataset that can be operated on in parallel.

val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)

Normally, Spark tries to set the number of partitions automatically based on your cluster. However, you can also set it manually by passing it as a second parameter to parallelize (e.g. sc.parallelize(data, 10)).

Spark can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, Amazon S3, etc. Spark supports text files, SequenceFiles, and any other Hadoop InputFormat.

scala> val distFile = sc.textFile("data.txt")
distFile: RDD[String] = MappedRDD@1d4cee08

SparkContext’s textFile method takes an URI for the file (either a local path on the machine, or a hdfs://, s3n://, etc URI) and reads it as a collection of lines.Once created, distFile can be acted on by dataset operations.

2）RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset.

All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program.

By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory (also support disk, or replicated across multiple nodes) using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. Storage level: MEMORY_ONLY, MEMORY_ONLY_SER, MEMORY_AND_DISK_SER, MEMORY_AND_DISK, DISK_ONLY...

Spark’s API relies heavily on passing functions in the driver program to run on the cluster. There are two recommended ways to do this:
- Anonymous function syntax, which can be used for short pieces of code.
- Static methods in a global singleton object. For example, you can define object MyFunctions and then pass MyFunctions.func1, as follows:

object MyFunctions {
  def func1(s: String): String = { ... }
}

myRdd.map(MyFunctions.func1)

While most Spark operations work on RDDs containing any type of objects, a few special operations are only available on RDDs of key-value pairs. The most common ones are distributed “shuffle” operations, such as grouping or aggregating the elements by a key.In Scala, these operations are automatically available on RDDs containing Tuple2 objects (the built-in tuples in the language, created by simply writing (a, b)), as long as you import org.apache.spark.SparkContext._ in your program to enable Spark’s implicit conversions. The key-value pair operations are available in the PairRDDFunctions class, which automatically wraps around an RDD of tuples if you import the conversions.

val lines = sc.textFile("data.txt")
val pairs = lines.map(s => (s, 1))
val counts = pairs.reduceByKey((a, b) => a + b)

Certain operations within Spark trigger an event known as the shuffle. The shuffle is Spark’s mechanism for re-distributing data so that is grouped differently across partitions. This typically involves copying data across executors and machines, making the shuffle a complex and costly operation.

Operations which can cause a shuffle include repartition operations , ‘ByKey operations (except for counting) , and join operations.To organize data for the shuffle, Spark generates sets of tasks - map tasks to organize the data, and a set of reduce tasks to aggregate it. This nomenclature comes from MapReduce.

常用operations：（具体查文档）

map, filter, flatMap, sample, union, intersection, distinct, groupByKey, reduceByKey, aggregateByKey, sortByKey, join, cartsian

常用actions：

reduce, collect, count, first, take, takeSample, takeOrdered, saveAsTextFile, countByKey, foreach

2.Shared Variables

As operations are lazy, read-write shared variables across tasks would be inefficient. So Spark provide two limited types of shared variables for two common usage patterns: broadcast variables and accumulators.

Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.After the broadcast variable is created, it should be used instead of the value v in any functions run on the cluster so that v is not shipped to the nodes more than once.

scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))
broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0)

scala> broadcastVar.value
res0: Array[Int] = Array(1, 2, 3)

Accumulators are variables that are only “added” to through an associative operation and can therefore be efficiently supported in parallel. They can be used to implement counters (as in MapReduce) or sums.

scala> val accum = sc.accumulator(0, "My Accumulator")
accum: spark.Accumulator[Int] = 0scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)
...10/09/29 18:41:08 INFO SparkContext: Tasks finished in 0.317106 s

scala> accum.value
res2: Int = 10

之后读一些例子。强烈推荐：http://dongxicheng.org/framework-on-yarn/spark-scala-writing-application/

以及doc上LR的一个完整例子：

import org.apache.spark.SparkContextimport org.apache.spark.mllib.classification.{LogisticRegressionWithLBFGS, LogisticRegressionModel}import org.apache.spark.mllib.evaluation.MulticlassMetricsimport org.apache.spark.mllib.regression.LabeledPointimport org.apache.spark.mllib.linalg.Vectorsimport org.apache.spark.mllib.util.MLUtils// Load training data in LIBSVM format.val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")// Split data into training (60%) and test (40%).val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0).cache()
val test = splits(1)// Run training algorithm to build the modelval model = new LogisticRegressionWithLBFGS()
  .setNumClasses(10)
  .run(training)// Compute raw scores on the test set.val predictionAndLabels = test.map { case LabeledPoint(label, features) =>
  val prediction = model.predict(features)
  (prediction, label)
}// Get evaluation metrics.val metrics = new MulticlassMetrics(predictionAndLabels)
val precision = metrics.precision
println("Precision = " + precision)// Save and load modelmodel.save(sc, "myModelPath")
val sameModel = LogisticRegressionModel.load(sc, "myModelPath")

val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt"val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L= splits(0= splits(1val model = 10val predictionAndLabels = test.map {  LabeledPoint(label, features) =>=val metrics = ="Precision = " +model.save(sc, "myModelPath"= LogisticRegressionModel.load(sc, "myModelPath")

Scala进击大数据Spark生态圈，进击Spark生态圈必备，迈向“高薪”的基石

猜你喜欢