Implement custom partitions in SPARK

Hello everyone:

 In the development of spark, sometimes data needs to be stored separately according to a certain field, which requires the use of spark's custom partition function.

Let me talk about the test data first, put it in the file "C:\test\url1.log", the data is as follows:

20170721101954	http://sport.sina.cn/sport/race/nba.shtml
20170721101954	http://sport.sina.cn/sport/watch.shtml
20170721101954	http://car.sina.cn/car/fps.shtml
20170721101954	http://sport.sina.cn/sport/watch.shtml

Explanation: Only four pieces of data are used to explain the effect, which is summarized according to the host name of the webpage visited.

  The custom partition is implemented in SPARK, the code is as follows:

package day04

import java.net.URL
import org.apache.spark.{Partitioner, SparkConf, SparkContext}
import scala.collection.mutable

/**
  * 功能: 演示 程序代码中的自定义分区
  *
  */
class NewPartiton(fornum:Array[String]) extends Partitioner{
  val partmap=new mutable.HashMap[String,Int]()
  var count= 0  // 表示分区号

  // 对for循环的目的是使 每个host 作为一个分区
  for( i <- fornum){
    partmap += (i->count)
    count += 1
  }

  // 为了保证每一个域名有一个分区,就用fornum.length的形式  源码用到
  override def numPartitions: Int = fornum.length

  //获得每个key的分区号  源码用到
  override def getPartition(key: Any): Int = {
    partmap.getOrElse(key.toString,0)
  }
}


object Partition{
  def main(args: Array[String]): Unit = {
    val conf=new SparkConf().setAppName("UrlCount").setMaster("local[2]")
    val sc=new SparkContext(conf)
    val lines=sc.textFile("c://test//url1.log")
    // 20170721101954	http://sport.sina.cn/sport/race/nba.shtml
    val text=lines.map(line=>{
      val f=line.split("\t")
      (f(1),1)  //最后一行作为返回值的  先给每个域名  后面增加1
    })
    val text1=text.reduceByKey(_+_) //统计每个域名的个数
    //    println(text1.collect.toBuffer)
    // http://sport.sina.cn/sport/race/nba.shtml   1
    val text2=text1.map(t=>{
      val url=t._1  //每个url
      val host=new URL(url).getHost
      (host,(url,t._2))  //返回每个host
    })

    val fornum=text2.map(_._1).distinct().collect()
    //    println(fornum)
    val np=new NewPartiton(fornum)
    //后面的partitionBy也是一个固定写法
    text2.partitionBy(np).saveAsTextFile("c://test//output2")

    sc.stop()  //关闭
  }

}

Gains: 1 Get the host name from the url 2 The rdd of text2 is really to be partitioned, the previous segmentation and grouping are all to get this rdd

            3 Lines 50-54 are the core of the program, which is to get the number of partitions, call the partition class, and save the results after the partition. For these three steps, think about it, the same is true. You only need to develop on this template in the future.

As a result of the program execution, an outpout2 folder will be generated in the test directory of the c drive, as shown in the screenshot below:

As you can see, there are two files part-00000 and part-00001 in the folder. Because the host names in the sample data are only sports and cars, it initially meets the requirements.

View the data in the file part-00000, as shown below:

(car.sina.cn,(http://car.sina.cn/car/fps.shtml,1))

View the data in the file part-00001, as shown below:

(sport.sina.cn,(http://sport.sina.cn/sport/watch.shtml,2))
(sport.sina.cn,(http://sport.sina.cn/sport/race/nba.shtml,1))

It can be seen from the data that there is one hostname for accessing the car and three hostnames for sports (2 sub-pages for watch and 1 sub-page for nba). This is consistent with the sample data and meets expectations.

Note: The partitioned class has a fixed format, just call this class directly, don't type it line by line.

Guess you like

Origin blog.csdn.net/zhaoxiangchong/article/details/78409235