Spark共享变量

Spark共享变量的描述（http://spark.apache.org/docs/1.6.3/programming-guide.html#shared-variables）

Normally, when a function passed to a Spark operation (such as map or reduce) is executed on a remote cluster node, it works on separate copies of all the variables used in the function. These variables are copied to each machine, and no updates to the variables on the remote machine are propagated back to the driver program. Supporting general, read-write shared variables across tasks would be inefficient. However, Spark does provide two limited types of shared variables for two common usage patterns: broadcast variables and accumulators.

（通常，当传递给Spark操作(例如map或reduce)的函数在远程集群节点上执行时，它将在函数中使用的所有变量的单独副本上工作。这些变量被复制到每台机器上，而对远程机器上的变量的更新不会被传播回驱动程序。在任务之间支持一般的读写共享变量将是低效的。不过，Spark确实为两个常见的使用模式提供了两种有限类型的共享变量:广播变量和累加器。）

一，Broadcast Variables（广播变量）

1，描述

Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.

（广播变量允许程序员将只读变量保存在每台机器上，而不是将其复制到任务中。例如，可以使用它们以有效的方式向每个节点提供一个大型输入数据集的副本。Spark还尝试使用高效的广播算法分发广播变量，以降低通信成本。）

Spark actions are executed through a set of stages, separated by distributed “shuffle” operations. Spark automatically broadcasts the common data needed by tasks within each stage. The data broadcasted this way is cached in serialized form and deserialized before running each task. This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.

（Spark的Action的操作，由分布式的“shuffle”操作分隔。Spark自动广播每个阶段中任务所需的公共数据。以这种方式广播的数据以序列化形式缓存并在运行每个任务之前反序列化。这意味着，当跨多个阶段的任务需要相同的数据或以反序列化形式缓存数据时，显式创建广播变量只会有用。）

2，Java的实现

package com.lyl.it;

import java.util.Arrays;
import java.util.List;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.VoidFunction;
import org.apache.spark.broadcast.Broadcast;

public class BroadcastTest {
	
	public static void main(String[] args) {
		SparkConf conf = new SparkConf().setAppName("Broadcast").setMaster("local");
		JavaSparkContext sc = new JavaSparkContext(conf);
		
		final int f = 3;
		final Broadcast<Integer> broadCastFactor = sc.broadcast(f);
		
		List<Integer> list = Arrays.asList(1,2,3,4,5);
		JavaRDD<Integer> listRDD = sc.parallelize(list);
		JavaRDD<Integer> result = listRDD.map(new Function<Integer, Integer>() {

			private static final long serialVersionUID = 1L;

			@Override
			public Integer call(Integer num) throws Exception {
//				return num * f;
				return num * broadCastFactor.value();
			}
		});
		
		result.foreach(new VoidFunction<Integer>() {
		
			private static final long serialVersionUID = 1L;

			@Override
			public void call(Integer num) throws Exception {
				System.out.println(num);
				
			}
		});
		
		sc.close();
		
	}

}

结果如下：

18/07/25 10:19:38 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
18/07/25 10:19:38 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2170 bytes)
18/07/25 10:19:39 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
3
6
9
12
15
18/07/25 10:19:39 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 915 bytes result sent to driver
18/07/25 10:19:39 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 544 ms on localhost (1/1)

二，Accumulators（累加器）

1，描述

Accumulators are variables that are only “added” to through an associative operation and can therefore be efficiently supported in parallel. They can be used to implement counters (as in MapReduce) or sums. Spark natively supports accumulators of numeric types, and programmers can add support for new types. If accumulators are created with a name, they will be displayed in Spark’s UI. This can be useful for understanding the progress of running stages (NOTE: this is not yet supported in Python).

（累加器是只通过关联操作“添加”的变量，因此可以并行地有效地支持。它们可以用于实现计数器(如MapReduce)或和。Spark本机支持数字类型的累加器，程序员可以添加对新类型的支持。如果使用名称创建累加器，它们将显示在Spark的UI中。这对于理解运行阶段的进展非常有用(注意:在Python中还不支持这一点)。）

An accumulator is created from an initial value v by calling SparkContext.accumulator(v). Tasks running on the cluster can then add to it using the add method or the += operator (in Scala and Python). However, they cannot read its value. Only the driver program can read the accumulator’s value, using its value method.

（通过调用SparkContext.accumulator(v)，从初始值v创建累加器。然后，在集群上运行的任务可以使用add方法或+=运算符(在Scala和Python中)添加到集群中。然而，他们无法解读它的价值。只有驱动程序可以使用累加器的值方法读取累加器的值。）

2，Java的实现

package com.lyl.it;

import java.util.Arrays;
import java.util.List;

import org.apache.spark.Accumulator;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.VoidFunction;

public class AccumulatorValueTest {
	
	public static void main(String[] args) {
      SparkConf conf = new SparkConf().setAppName("AccumulatorValue").setMaster("local");
      JavaSparkContext sc = new JavaSparkContext(conf);
      
      final Accumulator<Integer> sum = sc.accumulator(0,"Our Accumulator");
      
      List<Integer> list = Arrays.asList(1,2,3,4,5);
      
      JavaRDD<Integer> listRDD = sc.parallelize(list);
      listRDD.foreach(new VoidFunction<Integer>() {
		
		private static final long serialVersionUID = 1L;

		@Override
		public void call(Integer num) throws Exception {
			sum.add(num);
//			System.out.println(sum.value());
		}
	});
      
      System.out.println(sum.value());
      
     try {
		Thread.sleep(60 * 1000 * 1000);
	} catch (InterruptedException e) {
		e.printStackTrace();
	}
     
      sc.close();
		
	}

}

结果如下：

18/07/25 09:49:37 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2170 bytes)
18/07/25 09:49:37 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
18/07/25 09:49:37 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 975 bytes result sent to driver
18/07/25 09:49:37 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 675 ms on localhost (1/1)
18/07/25 09:49:37 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
18/07/25 09:49:37 INFO DAGScheduler: ResultStage 0 (foreach at AccumulatorValueTest.java:23) finished in 1.038 s
18/07/25 09:49:37 INFO DAGScheduler: Job 0 finished: foreach at AccumulatorValueTest.java:23, took 3.212206 s
15
18/07/25 10:12:55 INFO BlockManagerInfo: Removed broadcast_0_piece0 on localhost:56469 in memory (size: 1335.0 B, free: 1121.6 MB)
18/07/25 10:12:55 INFO ContextCleaner: Cleaned accumulator 2

也可以在UI上看到如下结果：

猜你喜欢