对于黑名单过滤,采用的是访问日志形成的DStreaml与黑名单列表形成的RDD进行left join的方式,进行过滤。
步骤解读:
日志列表:
20190102,192.168.10.101
20190102,192.168.10.102
20190102,192.168.10.103
将访问日志转换为相应的 DStream
==> (192.168.10.101:20190102,192.168.10.101)(192.168.10.102:20190102,192.168.10.102)(192.168.10.103: 20190102,192.168.101.103)
黑名单列表(前提已知以下ip为黑名单ip)
192.168.10.101
192.168.10.102
黑名单列表转换为相应的 RDD
==>(192.168.10.101: true)(192.168.10.101: true)
将DStream和RDD进行leftjoin
(192.168.10.101: [<20190102,192.168.10.101>, <true>]) x
(192.168.10.102: [<20190102,192.168.10.102>, <true>]) x
(192.168.10.103: [<20190102,192.168.10.103>, <false>]) ==> tuple 1
用scala代码实现如下:
package com.fyy.spark.streaming
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
/**
* @Title: TransformApp
* @ProjectName SparkStreamingProject
* @Description: 黑名单过滤
* @author fanyanyan
*/
/**
* 黑名单过滤
*/
object TransformApp {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setAppName("TransformApp").setMaster("local[*]")
/**
* 创建StreamingContext需要两个参数:SparkConf和batch interval
*/
val ssc = new StreamingContext(sparkConf, Seconds(5))
/**
* 构建黑名单RDD
* (192.168.10.101: true)(192.168.10.101: true)
*/
val blacks = List("192.168.101.101", "192.168.101.102")
val blacksRDD = ssc.sparkContext.parallelize(blacks).map(x => (x, true))
/**
* 获取日志记录DStream(在真正的线上系统需要一些处理)
* (192.168.10.101:20190102)
* (192.168.10.102:20190102)
* (192.168.10.103: 20190102)
*/
val loglines = ssc.socketTextStream("01.server.bd", 6666)
/**
* 1)进行DStream的数据组合形成新的map
* (192.168.10.101:20190102,192.168.10.101)
* (192.168.10.102:20190102,192.168.10.102)
* (192.168.10.103: 20190102,192.168.101.103)
*
* 2)进行DStream和黑名单RDD的leftjoin
* (192.168.10.101: [<20190102,192.168.10.101>, <true>]) x
* (192.168.10.102: [<20190102,192.168.10.102>, <true>]) x
* (192.168.10.103: [<20190102,192.168.10.103>, <false>]) ==> tuple 1
*/
val logs = loglines.map(x => (x.split(",")(1), x)).transform(rdd => {
rdd.leftOuterJoin(blacksRDD)
.filter(x => x._2._2.getOrElse(false) != true)
.map(x => x._2._1)
})
// 将过滤结果进行打印
logs.print()
ssc.start()
ssc.awaitTermination()
}
}