Spark aggregateMessage() 实现 PageRank算法

家行hang , 2018/12/14
注:
本文为陶大江老师的第二次大作业
作者为北京交通大学软件学院1605班 吴家行
仅供参考,不得以交作业为目的进行抄袭,一经发现,责任必究
其中,aggregateMessage实现PageRank的原创代码版权归吴家行所有,引用须注明来源,侵权者必究其责任

Requirements

Use aggregateMessage to implement PageRank, and use relation.txt to test this algorithm, develop the code and write the report.

Basic content about Spark taught in the class

  • Put the “.txt” file on the hdfs
hadoop fs -put ./sample.txt /txt
sample.txt:

|- weiboID -|- userID -|- ... -|— time -|
1,1,1,1,2007-01-01 01:11:11
2,2,1,2,2007-01-01 01:11:11
3,1,2,3,2007-01-01 01:11:11
4,2,1,2,2006-01-01 01:11:11
  • Count the number of lines whose fields are more than 4 after being splitted
val weibo = sc.textFile("/txt/sample.txt").map(_.split(",")).filter(_.length>4).map(a=>(a(1),1)).reduceByKey(_+_)

weibo.collect()
  • Count the number of weibo that one person sent per day
val weibo = sc.textFile("/txt/sample.txt").map(_.split(",")).filter(_.length>4).map(a=>((a(1),a(4).split(" ")(0)),1)).reduceByKey(_+_)

weibo.collect()
  • Graph

    relation.txt
    
    |- Vertice_1 -|- Vertice_2 -|
    10457	104594
    10457	104792
    10457	1010513714
    ...
    
    • Import the package we need
    import org.apache.spark._
    import org.apache.spark.graphx._
    // To make some of the examples work we will also need RDD
    import org.apache.spark.rdd.RDD
    
    • Load the file
    var graph=GraphLoader.edgeListFile(sc,"/txt/relation.txt")
    
    • Attributes
    graph.vertices
    /*
    eg: (10457,1) estimate depulate, each vertive is counted just once.
    */
    graph.edges
    /*
    eg: Edge(10457,104594,1) estimate depulate, each edge is counted just once.
    */
    graph.triplets
    /*
    eg: ((10457,1),(104594,1),1) estimate depulate, each triplet is counted just once.
    */
    
    • OutDegree
    val outDegree = graph.aggregateMessages[Int](triplet=>triplet.sendToSrc(1),_+_)
    
    • InDegree
    val inDegree = graph.aggregateMessages[Int](triplet=>triplet.sendToDst(1),_+_)
    
    • PageRank
    var ranks = graph.pageRank(0.1).vertices
    var k = ranks.takeOrdered(50)(Ordering[Double].on[(org.apache.spark.graphx.VertexId,Double)](_._2).reverse)
    

Running Result:

k: Array[(org.apache.spark.graphx.VertexId, Double)] = Array((1192329374,804.1055791064388), (1266321801,701.5269156596019), (1223762662,530.5372567002639), (1182391231,519.1673220758365), (1644395354,492.81861478726347), (1821898647,480.1640166858465), (1197161814,462.7333519524177), (1713926427,443.3658094952714), (1618051664,418.08405831303855), (2115302210,400.2226361296474), (1780417033,352.01631152874336), (1193491727,314.8578049829823), (1182389073,314.3784097885132), (1097201945,302.5312560603966), (1813080181,296.5630797924638), (1362607654,293.2201508072429), (1653689003,289.410495024051), (1638782947,282.46914436089656), (1275017594,276.3518204746084), (1615743184,267.6963991532803), (1718455577,261.4263168930853), (1644572034,259.85896343919865), (1660209951,258.880411151857...
  • Something to record

    • Start hdfs
    cd /usr/local/hadoop
    sbin/start-dfs.sh
    
    • Start spark
    cd /usr/local/spark
    sbin/start-all.sh 
    

    Hint: To test whether it runs successfully:

    jps #if Master and Worker appear, congratulations! Your spark has started.
    

    Hint: To enter the spark shell:

    cd /usr/local/spark
    bin/spark-shell
    

Code

  • PageRank.scala (Not satisfy the requirement, not use the “aggregateMessages” method)
import org.apache.spark._
import org.apache.spark.graphx._
// To make some of the examples work we will also need RDD
import org.apache.spark.rdd.RDD

var graph=GraphLoader.edgeListFile(sc,"/txt/relation.txt")

// The following part can be replaced by the "aggregateMessages" method.

//groupByKey:(A,B),(A,C)->(A,(B,C)) 
val links = graph.triplets.map(e=>(e.toTuple._1._1, e.toTuple._2._1)).distinct().groupByKey().cache()

//mapValues:(A,1.00),(B,1.00)
var ranks = links.mapValues(v => 1.00)

//iterate:
for(i <- 1 to 1000){
	// links.join(ranks):(A,((B,C), 1.0))
	// links.join(ranks).values: ((B,C), 1.0)
    val contrib = links.join(ranks).values.flatMap {
        case (urls, rank) => {
            val size = urls.size
          	urls.map(url => (url, rank/size))
        }
    }
    ranks = contrib.reduceByKey(_+_).mapValues(0.15+0.85*_)
}

var k = ranks.takeOrdered(50)(Ordering[Double].on[(org.apache.spark.graphx.VertexId, Double)](_._2).reverse)

Running Result:

k: Array[(org.apache.spark.graphx.VertexId, Double)] = Array((1821898647,103.55775723841994), (1192329374,95.34359755549576), (1266321801,84.4653672592073), (1713926427,60.050158312515144), (1644395354,59.08140449614647), (1618051664,58.588584348150704), (1197161814,57.24569086918721), (1888410492,49.35202259030924), (1961470985,49.287601461033), (2664548142,49.18104117814592), (2115302210,49.09348413606172), (1677313954,47.447037012585525), (1750925964,45.34299202002895), (1182391231,44.843919608703324), (1642591402,44.36001257092747), (1813656737,44.146827649935055), (1865645065,43.313227149592485), (1704392962,42.14409582636746), (1653689003,41.20287637931315), (1223762662,40.96179721957556), (1799360594,40.26616062038427), (1671526850,40.150053826087344), (1792149095,39.895956821306...
  • PageRank.scala (Satisfy the requirement, use the "aggregateMessages"method)
    注,该aggregateMessage实现PageRank的原创代码版权归吴家行所有,引用须注明来源,侵权者必究其责任
import org.apache.spark._
import org.apache.spark.graphx._
// To make some of the examples work we will also need RDD
import org.apache.spark.rdd.RDD
import scala.util.control.Breaks._

var graph=GraphLoader.edgeListFile(sc,"/txt/relation.txt")

// The following part can be replaced by the "aggregateMessages" method.

var newedges = graph.edges
var vertices = graph.vertices
var outDegree = graph.aggregateMessages[Int](triplet=>triplet.sendToSrc(1),_+_)//not record the vertices with outDegree of 0
var rank = vertices.mapValues(v => 1.00)
var newvertices = outDegree.join(rank)
var newgraph = Graph(newvertices, newedges)
//((10457,(211,1.0)),(104594,(1,1.0)),1)
var newvertices2_out = newgraph.vertices.mapValues(x=>{
    if(x==null){
        (0,1.0)//record the vertices with outDegree of 0
    }else{
        x
    }
})
newgraph = Graph(newvertices2_out, newedges)


var rank = newgraph.aggregateMessages[Double](triplet=>{
  	triplet.sendToDst(triplet.srcAttr._2*(1/triplet.srcAttr._1))  
},_+_)

rank = rank.mapValues(0.15+0.85*_)//these are the vertices with outDegree of non-zero

var zero_indegree = vertices.mapValues(v=>1.00).minus(rank)

var r: VertexRDD[Double] = VertexRDD(zero_indegree.union(rank))
rank =r

var result=""
var i = 0
breakable {
    while(true){
        i=i+1
        var rank_previous = rank
        
        var newvertices = outDegree.join(rank)
        var newgraph = Graph(newvertices, newedges)
        var newvertices2_out = newgraph.vertices.mapValues(x=>{
            if(x==null){
                (0,1.0)//record the vertices with outDegree of 0
            }else{
                x
            }
        })
        newgraph = Graph(newvertices2_out, newedges)

        rank = newgraph.aggregateMessages[Double](triplet=>{
            if(triplet.dstAttr!=null){
                triplet.sendToDst(triplet.srcAttr._2*(1/triplet.srcAttr._1))
            }	  
        },_+_)
        rank = rank.mapValues(0.15+0.85*_)
        var zero_indegree = vertices.mapValues(v=>1.00).minus(rank)
        var r1: VertexRDD[Double] = VertexRDD(zero_indegree.union(rank))
        rank = r1
        //判断终止条件
        var diff = rank_previous.join(rank).mapValues((x$2)=>{
            if(x$2._1-x$2._2<0){
                x$2._2-x$2._1
            }else{
                x$2._1-x$2._2
            }
        })
        println("第"+i+"次迭代:"+"不收敛的个数:"+diff.filter(_._2>0.1).count);
        result+=("第"+i+"次迭代:"+"不收敛的个数:"+diff.filter(_._2>0.1).count+"\n")
        if(diff.filter(_._2>0.1).count==0){
            break;
        }
    }
}
result+=("===》共进行"+i+"次迭代")
println("结果出来啦!恭喜你,你的电脑没挂,以下是你的结果:\n"+result)
var k = rank.takeOrdered(10)(Ordering[Double].on[(org.apache.spark.graphx.VertexId, Double)](_._2).reverse)

Running Result:

scala> println("结果出来啦!恭喜你,你的电脑没挂,以下是你的结果:\n"+result)
结果出来啦!恭喜你,你的电脑没挂,以下是你的结果:
第1次迭代:不收敛的个数:17471
第2次迭代:不收敛的个数:210
第3次迭代:不收敛的个数:37
第4次迭代:不收敛的个数:13
第5次迭代:不收敛的个数:13
第6次迭代:不收敛的个数:13
第7次迭代:不收敛的个数:13
第8次迭代:不收敛的个数:13
第9次迭代:不收敛的个数:13
第10次迭代:不收敛的个数:13
第11次迭代:不收敛的个数:13
第12次迭代:不收敛的个数:13
第13次迭代:不收敛的个数:13
第14次迭代:不收敛的个数:5
第15次迭代:不收敛的个数:5
第16次迭代:不收敛的个数:5
第17次迭代:不收敛的个数:5
第18次迭代:不收敛的个数:2
第19次迭代:不收敛的个数:2
第20次迭代:不收敛的个数:0
===》共进行20次迭代

scala> var k = rank.takeOrdered(10)(Ordering[Double].on[(org.apache.spark.graphx.VertexId, Double)](_._2).reverse)
18/12/22 16:25:39 WARN ShippableVertexPartitionOps: Minus operations on two VertexPartitions with different indexes is slow.
18/12/22 16:25:39 WARN ShippableVertexPartitionOps: Minus operations on two VertexPartitions with different indexes is slow.
k: Array[(org.apache.spark.graphx.VertexId, Double)] = Array((1192329374,3023.30675), (1266321801,2821.644250000001), (1223762662,2484.553375), (1182391231,2325.348375), (1644395354,2062.25), (1821898647,1981.2343750000005), (1197161814,1805.8474999999999), (1713926427,1797.2625), (2115302210,1628.07), (1193491727,1569.5475))

Reference Links

Spark实现PageRank算法

Hadoop与Spark算法分析(四)——PageRank算法

spark—PageRank

PageRank算法在spark上的简单实现

Spark学习笔记:aggregateMessages

GraphX Programming Guide-官方文档

spark入门三(RDD基本运算)

GraphX:RDD详解

猜你喜欢

转载自blog.csdn.net/weixin_41580638/article/details/85010624