Basic RDD operation

definition

RDD is a Resilient Distributed Dataset. RDD is actually a collection of distributed elements. Just like List, Array, Set, Map collection. In Spark, all operations on data are nothing more than creating RDDs, transforming existing RDDs, and calling RDD operations for evaluation. Behind all of this, Spark will automatically distribute the data in the RDD to the cluster and parallelize the execution of operations.

Users can use two methods to create RDD: read an external data set, or distribute the object collection in the driver program (such as list and set) in the driver program;

RDD is read-only. Once generated, it cannot be modified;

RDD can be obtained by recalculation;

For example, read the file ReadME.md data

》lines = sc.textFile("README.md") 

》pythonLines = lines.filter(lambda line: "Python" in line) 

Spark will only calculate these RDDs lazily, that is, they will only be calculated when they are used for the first time in an action operation. The advantage of this is that in the above example, we define the data in a text file, and then filter out the lines containing Python. If Spark reads and stores all the lines in the file when we run lines= sc.textFile(...), it will consume a lot of storage space, and we will soon filter out a lot of data. On the contrary, once Spark understands the complete chain of conversion operations, it can only calculate the data that is really needed for the result. In fact, in the action operation first(), Spark only needs to scan the file until it finds the first matching line, instead of reading the entire file. By default, Spark's RDDs will be recalculated every time you perform actions on them. If you want to reuse the same RDD in multiple actions, you can use RDD.persist() to let Spark cache the RDD

Transformation operation

Each conversion operation will generate a new RDD. The converted RDDs are evaluated lazily, and will only be calculated when these RDDs are used in the action operation.

Conversion Description
map Each element in the data set is transformed by a function to form a new distributed data set
filter Filter function, select the elements in the data set that let the function return true to form a new data set
flatMap Similar to map, but each input item can be mapped to 0 or more output items
mapParttions Similar to map, but run separately on each partition of RDD
union Return a data set of experience combined with the original data set and parameters
distinct Return a new data set after deduplication of a data set

Description

​ map:

​ The first time I saw the operation of map, I didn't understand it for a long time. The reason is that map in java is a set of k, v key-value pairs. I don't understand what's going on. It took a long time to understand that this is not the same as the map in java. The map input is an RDD, the output is also an RDD, the number of RDDs will not change.

​ flatMap:

​ is an RDD conversion function that accepts a function as input, electrophoresis input function for all members of the current RDD, and returns a new RDD. For each input RDD, flatMap returns a set. The members in the set will be Expand, one input can correspond to multiple outputs. For example, if "# Apache Spark" is divided into spaces, the returned array contains 3 members, "#", "Apache", and "Spark". Finally, the 3 members will become direct members of RDD.

Assuming there are N elements and M partitions, the map function will be called N times, and mapParttions will be called M times, and the function will process one partition at a time.

Action operation

Perform action operations after data conversion, the output result is no longer RDD, and returns to the Driver program

operating Description
reduce Use the reduce operation on the RDD member, and the return result has only one value
collect Read the RDD to the Driver program, the type is an Array, generally the RDD should not be too large
count Returns the number of RDD members
first Return a member of RDD
take(n) Return the first n members
saveAsTextFile(path) Convert RDD to text content and save it to path, there may be multiple files, path can be a specific path or HDFS address
saveAsSequenceFile(path) Similar to saveAsTextFile, but saved in SequenceFile format
countBykey Only applicable to (k, v) type, for key calculation, return (k, int)
foreach(func) The callback func is executed for each member in the RDD, and there is no return value. It is often used to update the calculator or output data to an external storage system. Here you need to pay attention to the scope of the variable

Description:

collect:

Can be used to get the data in the entire RDD. If your program filters the RDD to a small scale, and you want to process
the data locally , you can use it. You can use collect() only when your entire data set can fit in the memory of a single machine. Therefore, collect() cannot be used on large-scale data sets.

Guess you like

Origin blog.csdn.net/samz5906/article/details/83547344