Spark算子:RDD创建的方式

创建RDD大体分为两类方式:(1)通过集合创建;(2)通过外部存储创建。

1、通过集合方式

(1)parallelize:def parallelize[T](seq: Seq[T], numSlices: Int = defaultParallelism)(implicit arg0: ClassTag[T]): RDD[T]

参数seq指Seq集合,numSlices指分区数。

scala> var rdd = sc.parallelize(1 to 10)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at :21
 
scala> rdd.collect
res3: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
 
scala> rdd.partitions.size
res4: Int = 15
 
//设置RDD为3个分区
scala> var rdd2 = sc.parallelize(1 to 10,3)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize at :21
 
scala> rdd2.collect
res5: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
 
scala> rdd2.partitions.size
res6: Int = 3

(2)makeRDD

1)def makeRDD[T](seq: Seq[T], numSlices: Int = defaultParallelism)(implicit arg0: ClassTag[T]): RDD[T]
2)def makeRDD[T](seq: Seq[(T, Seq[String])])(implicit arg0: ClassTag[T]): RDD[T]

方法1)与parallelize方法一样,方法2)可以指定定每一个分区的preferredLocations。

scala> var collect = Seq((1 to 10,Seq("slave007.lxw1234.com","slave002.lxw1234.com")),
(11 to 15,Seq("slave013.lxw1234.com","slave015.lxw1234.com")))
collect: Seq[(scala.collection.immutable.Range.Inclusive, Seq[String])] = List((Range(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
List(slave007.lxw1234.com, slave002.lxw1234.com)), (Range(11, 12, 13, 14, 15),List(slave013.lxw1234.com, slave015.lxw1234.com)))
 
scala> var rdd = sc.makeRDD(collect)
rdd: org.apache.spark.rdd.RDD[scala.collection.immutable.Range.Inclusive] = ParallelCollectionRDD[6] at makeRDD at :23
 
scala> rdd.partitions.size
res33: Int = 2
 
scala> rdd.preferredLocations(rdd.partitions(0))
res34: Seq[String] = List(slave007.lxw1234.com, slave002.lxw1234.com)
 
scala> rdd.preferredLocations(rdd.partitions(1))
res35: Seq[String] = List(slave013.lxw1234.com, slave015.lxw1234.com)

2、通过外部存储创建

(1)HDFS上文件格式:textFile

//从hdfs文件创建
scala> var rdd = sc.textFile("hdfs:///tmp/lxw1234/1.txt")
rdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at textFile at :21
 
scala> rdd.count
res48: Long = 4
 
//从本地文件创建
scala> var rdd = sc.textFile("file:///etc/hadoop/conf/core-site.xml")
rdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[28] at textFile at :21
 
scala> rdd.count
res49: Long = 97 

(2)HDFS上文件格式:hadoopFile、sequenceFile、objectFile和newAPIHadoopFile

(3)Hadoop接口API创建:hadoopRDD、newAPIHadoopRDD

示例:从HBase创建RDD

scala> import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor, TableName}
import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor, TableName}
 
scala> import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
 
scala> import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.client.HBaseAdmin
 
scala> val conf = HBaseConfiguration.create()
scala> conf.set(TableInputFormat.INPUT_TABLE,"lxw1234")
scala> var hbaseRDD = sc.newAPIHadoopRDD(
conf,classOf[org.apache.hadoop.hbase.mapreduce.TableInputFormat],classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],classOf[org.apache.hadoop.hbase.client.Result])
 
scala> hbaseRDD.count
res52: Long = 1
 

猜你喜欢

转载自blog.csdn.net/csmnjk/article/details/82797518