Spark案例实战之三
一.简易日志分析
1.现有如下记录的日志,欲把每种状态提取并计数,然后从低到高排数。
INFO This is a message with content
INFO This is some other content
INFO Here are more messages
WARN This is a warning
ERROR Something bad happened
WARN More details on the bad thing
INFO back to normal messages
2.具体代码如下:
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
/**
* 1.一个简单的日志分析系统
* 2,从文本中读取数据,然后记录日志中不同状态的个数
*/
object EasyLogAnalyze{
var blankLines = 0
def main(args:Array[String]): Unit = {
val conf = new SparkConf().setAppName("EasyLogAnalyze").setMaster("local")
val sc = new SparkContext(conf)
val text: RDD[String] = sc.textFile(args(0))
//read file from args(0)
text.foreach(println)
/*
1.line是参数
2.{}中的内容是函数处理步骤
*/
val res1: RDD[(String, Int)] = text.map(line=>{
var symbol :String = null
if(line!=""){
symbol = line.substring(0,line.indexOf(" "))//字符标志
}//取首字符串
else {
blankLines += 1
}
(symbol,1)//返回symbol
})
val res2: RDD[(String, Int)] = res1.reduceByKey(_+_)
res2.sortBy(_._2).foreach(println)
}
}
输出结果如下:
(ERROR,1)
(null,2)
(WARN,2)
(INFO,4)
将处理文本行的那个函数提炼一下,就得到一个函数,然后将函数作为参数输入到map中,则有:
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
/**
* 1.a simple log analyze program
* 2.read data from input.text,and count the number of different states in the log
*/
object EasyLogAnalyze{
var blankLines = 0
/*
1.Define a function to process text rows
2.RDD[String] is not equals string,but they are very similar
*/
def process(line:String): (String, Int) ={
var symbol :String = null
if(line!=""){
symbol = line.substring(0,line.indexOf(" "))//temp result
}//get first string
else {
blankLines += 1
}
(symbol,1)//return symbol
}
def main(args:Array[String]): Unit = {
val conf = new SparkConf().setAppName("EasyLogAnalyze").setMaster("local")
val sc = new SparkContext(conf)
val text: RDD[String] = sc.textFile(args(0))
text.foreach(println)
/*
1.line is a parameter
2.the content in big brace is the specific step of procession
*/
val res1: RDD[(String, Int)] = text.map((line: String) => process(line))
val res2: RDD[(String, Int)] = res1.reduceByKey(_+_)
res2.sortBy(_._2).foreach(println)
}
}