版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/hochoy/article/details/80599234
源码:
/** * Return a sampled subset of this RDD. * * @param withReplacement can elements be sampled multiple times (replaced when sampled out) * @param fraction expected size of the sample as a fraction of this RDD's size * without replacement: probability that each element is chosen; fraction must be [0, 1] * with replacement: expected number of times each element is chosen; fraction must be >= 0 * @param seed seed for the random number generator */ def sample( withReplacement: Boolean, fraction: Double, seed: Long = Utils.random.nextLong): RDD[T] = withScope { require(fraction >= 0.0, "Negative fraction value: " + fraction) if (withReplacement) { new PartitionwiseSampledRDD[T, T](this, new PoissonSampler[T](fraction), true, seed) } else { new PartitionwiseSampledRDD[T, T](this, new BernoulliSampler[T](fraction), true, seed) } }
参数说明:
withReplacement :取样后是否将元素放回。
true:元素放回 ,返回的子集会有重复,可以被多次抽样;
false:元素不放回 ,返回的子集没有重复
fraction:期望样本的大小作为RDD大小的一部分,
seed:随机数生成器的种子
示例:
object sample{ def main(args: Array[String]) { val conf = new SparkConf().setMaster("local[2]").setAppName(this.getClass.getName) val sc = new SparkContext(conf) val rdd1 = sc.parallelize(List(1,2,3,4,5,6,7,8,9)) rdd1.sample(false,0.6).collect().mkString(",").map(print)// 输出 2,3,6,8,9 rdd1.sample(true,0.6).collect().mkString(".").map(print) // 输出 2.4.4.6 } }