spark(五)-wordcount执行过程

有了文件读写过程，就可以读取一个文件执行简单的hello spark程序了。

wordcount执行过程

val lines = sc.textFile(“D:/resources/README.md”)
val words = lines.flatMap(.split(" ")).filter(word => word != " ")
val counts = words.map(word => (word,1)).reduceByKey( + _)
counts.collect().foreach(wordNum=>println(wordNum._1+":"+wordNum._2))

textFile()返回一个HadoopRDD(继承自RDD)，只有文本行，不是(key,value)的。

HadoopRDD的flatMap()返回一个MapPartitionsRDD（继承自RDD）。
RDD的map()返回一个MapPartitionsRDD，RDD中并没有定义reduceByKey()函数。

reduceByKey()

这用到了scala的隐式转换。
一个从类型 S 到类型 T 的隐式转换由一个函数类型 S => T的隐式值来定义，或者由一个可转换成所需值的隐式方法来定义。

Scala 2.10引入了一种叫做隐式类的新特性。
隐式类指的是用implicit关键字修饰的类。在对应的作用域内，带有这个关键字的类的主构造函数可用于隐式转换。

隐式转换，使对象能调用类中本不存在的方法

// https://www.cnblogs.com/MOBIN/p/5351900.html
class SwingType{
  def  wantLearned(sw : String) = println("兔子已经学会了"+sw)
}
object swimming{
  implicit def learningType(s : AminalType) = new SwingType
}
class AminalType
object AminalType extends  App{
  import com.mobin.scala.Scalaimplicit.swimming._
  val rabbit = new AminalType
  // 编译器发现rabbit对象没有wantLearned方法，此时编译器就会在作用域范围内查找能使其编译通过的隐式视图，
  // 找到implicit的learningType方法后，编译器通过隐式转换将对象转换成具有这个方法的对象(SwingType)，
  // 之后调用其wantLearned方法
    rabbit.wantLearned("breaststroke")         //蛙泳
}

对RDD调用reduceByKey()时，编译器会在作用域范围内查找隐式转换，
在RDD的伴生对象中找到了隐式转换方法，于是通过隐式转换将RDD转换成具有reduceByKey()的对象PairRDDFunctions。
然后调用PairRDDFunctions的reduceByKey()方法。

  def reduceByKey(func: (V, V) => V): RDD[(K, V)] = self.withScope {
    reduceByKey(defaultPartitioner(self), func)
  }
  // ......
  //reduceByKey的实现
  def combineByKeyWithClassTag[C](
      createCombiner: V => C,
      mergeValue: (C, V) => C,
      mergeCombiners: (C, C) => C,
      partitioner: Partitioner,
      mapSideCombine: Boolean = true,
      serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {
    require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0
    if (keyClass.isArray) {
      if (mapSideCombine) {
        throw new SparkException("Cannot use map-side combining with array keys.")
      }
      if (partitioner.isInstanceOf[HashPartitioner]) {
        throw new SparkException("HashPartitioner cannot partition array keys.")
      }
    }
    val aggregator = new Aggregator[K, V, C](
      self.context.clean(createCombiner),
      self.context.clean(mergeValue),
      self.context.clean(mergeCombiners))
    if (self.partitioner == Some(partitioner)) {
      self.mapPartitions(iter => {
        val context = TaskContext.get()
        new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))
      }, preservesPartitioning = true)
    } else {
      new ShuffledRDD[K, V, C](self, partitioner)
        .setSerializer(serializer)
        .setAggregator(aggregator)
        .setMapSideCombine(mapSideCombine)
    }
  }

foreach()

RDD的foreach()是一个Action方法，调用了SparkContext的runJob()，处理了progressBar，并做了RDD的doCheckpoint()。

  /**
   * Applies a function f to all elements of this RDD.
   */
  def foreach(f: T => Unit): Unit = withScope {
    val cleanF = sc.clean(f)
    sc.runJob(this, (iter: Iterator[T]) => iter.foreach(cleanF))
  }

runJob()

SparkContext的runJob调用了初始化时候创建的DAGScheduler的runJob()方法。
参数func: (TaskContext, Iterator[T]) 代表了foreach()等Action操作传入的函数，是对RDD每一条记录要进行的操作。

  /**
   * Run a function on a given set of partitions in an RDD and pass the results to the given
   * handler function. This is the main entry point for all actions in Spark.
   */
  def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      resultHandler: (Int, U) => Unit): Unit = {
    if (stopped.get()) {
      throw new IllegalStateException("SparkContext has been shutdown")
    }
    val callSite = getCallSite
    val cleanedFunc = clean(func)
    logInfo("Starting job: " + callSite.shortForm)
    if (conf.getBoolean("spark.logLineage", false)) {
      logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
    }
    dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
    progressBar.foreach(_.finishAll())
    rdd.doCheckpoint()
  }

DAGScheduler.runJob()
在给定的 RDD上运行一个action的job，并把所有最终结果回传递给resultHandler对象。
DAGScheduler在runJob()主要是执行了submitJob()，把任务提交了。

  def submitJob[T, U](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      callSite: CallSite,
      resultHandler: (Int, U) => Unit,
      properties: Properties): JobWaiter[U] = {
    // Check to make sure we are not launching a task on a partition that does not exist.
    val maxPartitions = rdd.partitions.length
    partitions.find(p => p >= maxPartitions || p < 0).foreach { p =>
      throw new IllegalArgumentException(
        "Attempting to access a non-existent partition: " + p + ". " +
          "Total number of partitions: " + maxPartitions)
    }

    val jobId = nextJobId.getAndIncrement()
    if (partitions.size == 0) {
      // Return immediately if the job is running 0 tasks
      return new JobWaiter[U](this, jobId, 0, resultHandler)
    }

    assert(partitions.size > 0)
    val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _]
    val waiter = new JobWaiter(this, jobId, partitions.size, resultHandler)
    eventProcessLoop.post(JobSubmitted(
      jobId, rdd, func2, partitions.toArray, callSite, waiter,
      SerializationUtils.clone(properties)))
    waiter
  }