spark basic method


Conversion (Transformations)

Transformation

Meaning

map( func)

It returns a new set of data distributed by each element of the original composition after conversion function func

filter( func)

It returns a new data set, the return value is true after the original elements function func

flatMap( func)

Similar map, but each input element is mapped to a plurality of output elements is 0 (and therefore, the return value is a function func Seq, rather than a single element)

sample( withReplacementfracseed)

According to a given random seed seed, a random sampling of data the number of frac

union(otherDataset)

It returns a new set of data, the original data set and the joint from the parameter

groupByKey([numTasks])

In one called by the (K, V) consisting of data sets, a return (K, Seq [V]) of the data set. Note: By default, using eight parallel tasks are grouped, you can pass numTask optional parameter, the amount of data provided a different number of Task

(GroupByKey and filter binding, similar functions may be implemented in Hadoop Reduce)

reduceByKey(func, [numTasks])

In a (K, V) of a dataset, a return (K, V) of the data set, the same value of the key, are specified reduce function aggregated together. And groupbykey Similarly, the number of tasks that may be configured by an optional second parameter.

join(otherDataset, [numTasks])

In the type (K, V), and (K, W) call type data set, returns a (K, (V, W)) pairs, all the key elements in each of the data sets together are

groupWith(otherDataset, [numTasks])

In the type (K, V), and (K, W) set on the type of data call, it returns a data set for the constituent elements (K, Seq [V], Seq [W]) Tuples. This operation other frames, referred CoGroup

cartesian(otherDataset)

Cartesian Product. But when called on a data set T and U, a return (T, the U-) of the data set, all elements interactions Cartesian product.

sortByKey([ascendingOrder])

In the data set type (K, V) of the call, it returns to sort key K (K, V) of the data set. Determined by ascending or descending a boolean parameter ascendingOrder

(Similar to the Sort Map-Reduce Hadoop intermediate stage of sorting by Key)

Actions (Action)

Action

Meaning

reduce( func)

All elements of aggregate data set by the function func. Func function takes two parameters and returns a value. This function must be relevance, ensure the correct concurrent execution

collect()

Driver program in the form of an array and returns all the elements of the data set. This usually after use or other filter operations, small enough to return a subset of data re-use, directly Collect the whole set RDD returned, it may make OOM Driver Program

count()

Returns the number of elements of the data set

take( n)

It returns an array, the first n elements of the data set composed. Note that this is currently not in operation on a plurality of nodes, in parallel, but all the elements Driver program calculates where the machine, single

(Gateway memory pressure increases, we need to use caution)

first()

The first element of the returned data set (analogous to take (1))

saveAsTextFile(path)

The elements of the data set, as textfile, and saved to the local file system, or any other hadoop HDFS supported file system. Spark method toString of each element will be called, and converts it into a line of text file

saveAsSequenceFile(path)

The elements of the data set to sequencefile format, stored in a special directory, the local system, or any other hadoop HDFS supported file system. RDD element must consist of key-value, and have achieved the Hadoop Writable interfaces, or can be converted implicitly Writable (including the Spark basic type of conversion, e.g. Int, Double, String, etc.)

foreach( func)

On each element of the data set, run the function func. This is typically used to update a variable accumulator, or do interactive systems and external storage

Cache

Call RDD's cache () method, can make it after the first calculation, the results remain stored in memory. Different portions of the data set will be stored on different computing its cluster nodes, so that the subsequent dataset faster. The cache is fault tolerance, if any of the partitions RDD data is lost, it will be used to create the original convert it, and then counted once (do not need to recalculate all, counting only lost partition)

Shared Variables

Shared variables

Generally, when a function is passed to the operating Spark (e.g. map and the reduce), usually runs on a cluster node, to use all the variables in the function, respectively, do copy function for the operation, without mutual influences. These variables will be copied to each machine, but on a remote machine, all updates to the variables will not be propagated back to the Driver program. However, Spark offers limited share two types of variables, provides two common usage patterns: Broadcast variables and totalizer

Broadcast variables

Broadcast variables allow the programmer to retain a read-only variable, the cache on each machine, rather than keep a copy of each task. They may be used, e.g., for each node of a large set of input data, in an efficient manner. Spark will try to use an efficient broadcasting algorithm to reduce the loss of communication.

广播变量是从变量V创建的,通过调用SparkContext.broadcast(v)方法。这个广播变量是一个v的分装器,它的只可以通过调用value方法获得。如下的解释器模块展示了如何应用:

scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))

broadcastVar: spark.Broadcast[Array[Int]] = spark.Broadcast(b5c40191-a864-4c7d-b9bf-d87e1a4e787c)

scala> broadcastVar.value

res0: Array[Int] = Array(1, 2, 3)

在广播变量被创建后,它能在集群运行的任何函数上,被取代v值进行调用,从而v值不需要被再次传递到这些结点上。另外,对象v不能在被广播后修改,是只读的,从而保证所有结点的变量,收到的都是一模一样的。

累加器

累加器是只能通过组合操作“加”起来的变量,可以高效的被并行支持。他们可以用来实现计数器(如同MapReduce中)和求和。Spark原生就支持Int和Double类型的计数器,程序员可以添加新的类型。

一个计数器,可以通过调用SparkContext.accumulator(V)方法来创建。运行在集群上的任务,可以使用+=来加值。然而,它们不能读取计数器的值。当Driver程序需要读取值的时候,它可以使用.value方法。

如下的解释器,展示了如何利用累加器,将一个数组里面的所有元素相加

scala> val accum = sc.accumulator(0)

accum: spark.Accumulator[Int] = 0

scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)

10/09/29 18:41:08 INFO SparkContext: Tasks finished in 0.317106 s

scala> accum.value

res2: Int = 10

Guess you like

Origin blog.csdn.net/ws_developer/article/details/51556735