我们首先提出这样一个简单的需求:
现在要分析某网站的访问日志信息,统计来自不同IP的用户访问的次数,从而通过Geo信息来获得来访用户所在国家地区分布状况。这里我拿我网站的日志记录行示例,如下所示:
1
2
3
4
5
6
|
121.205
.
198.92
- - [
21
/Feb/
2014
:
00
:
00
:
07
+
0800
]
"GET /archives/417.html HTTP/1.1"
200
11465
"http://shiyanjun.cn/archives/417.html/"
"Mozilla/5.0 (Windows NT 5.1; rv:11.0) Gecko/20100101 Firefox/11.0"
121.205
.
198.92
- - [
21
/Feb/
2014
:
00
:
00
:
11
+
0800
]
"POST /wp-comments-post.php HTTP/1.1"
302
26
"http://shiyanjun.cn/archives/417.html/"
"Mozilla/5.0 (Windows NT 5.1; rv:23.0) Gecko/20100101 Firefox/23.0"
121.205
.
198.92
- - [
21
/Feb/
2014
:
00
:
00
:
12
+
0800
]
"GET /archives/417.html/ HTTP/1.1"
301
26
"http://shiyanjun.cn/archives/417.html/"
"Mozilla/5.0 (Windows NT 5.1; rv:11.0) Gecko/20100101 Firefox/11.0"
121.205
.
198.92
- - [
21
/Feb/
2014
:
00
:
00
:
12
+
0800
]
"GET /archives/417.html HTTP/1.1"
200
11465
"http://shiyanjun.cn/archives/417.html"
"Mozilla/5.0 (Windows NT 5.1; rv:11.0) Gecko/20100101 Firefox/11.0"
121.205
.
241.229
- - [
21
/Feb/
2014
:
00
:
00
:
13
+
0800
]
"GET /archives/526.html HTTP/1.1"
200
12080
"http://shiyanjun.cn/archives/526.html/"
"Mozilla/5.0 (Windows NT 5.1; rv:11.0) Gecko/20100101 Firefox/11.0"
121.205
.
241.229
- - [
21
/Feb/
2014
:
00
:
00
:
15
+
0800
]
"POST /wp-comments-post.php HTTP/1.1"
302
26
"http://shiyanjun.cn/archives/526.html/"
"Mozilla/5.0 (Windows NT 5.1; rv:23.0) Gecko/20100101 Firefox/23.0"
|
Java实现Spark应用程序(Application)
我们实现的统计分析程序,有如下几个功能点:
从HDFS读取日志数据文件
将每行的第一个字段(IP地址)抽取出来
统计每个IP地址出现的次数
根据每个IP地址出现的次数进行一个降序排序
根据IP地址,调用GeoIP库获取IP所属国家
打印输出结果,每行的格式:[国家代码] IP地址 频率
下面,看我们使用Java实现的统计分析应用程序代码,如下所示:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
|
package
org.shirdrn.spark.job;
import
java.io.File;
import
java.io.IOException;
import
java.util.Arrays;
import
java.util.Collections;
import
java.util.Comparator;
import
java.util.List;
import
java.util.regex.Pattern;
import
org.apache.commons.logging.Log;
import
org.apache.commons.logging.LogFactory;
import
org.apache.spark.api.java.JavaPairRDD;
import
org.apache.spark.api.java.JavaRDD;
import
org.apache.spark.api.java.JavaSparkContext;
import
org.apache.spark.api.java.function.FlatMapFunction;
import
org.apache.spark.api.java.function.Function2;
import
org.apache.spark.api.java.function.PairFunction;
import
org.shirdrn.spark.job.maxmind.Country;
import
org.shirdrn.spark.job.maxmind.LookupService;
import
scala.Serializable;
import
scala.Tuple2;
public
class
IPAddressStats
implements
Serializable {
private
static
final
long
serialVersionUID = 8533489548835413763L;
private
static
final
Log LOG = LogFactory.getLog(IPAddressStats.
class
);
private
static
final
Pattern SPACE = Pattern.compile(
" "
);
private
transient
LookupService lookupService;
private
transient
final
String geoIPFile;
public
IPAddressStats(String geoIPFile) {
this
.geoIPFile = geoIPFile;
try
{
// lookupService: get country code from a IP address
File file =
new
File(
this
.geoIPFile);
LOG.info(
"GeoIP file: "
+ file.getAbsolutePath());
lookupService =
new
AdvancedLookupService(file, LookupService.GEOIP_MEMORY_CACHE);
}
catch
(IOException e) {
throw
new
RuntimeException(e);
}
}
@SuppressWarnings
(
"serial"
)
public
void
stat(String[] args) {
JavaSparkContext ctx =
new
JavaSparkContext(args[
0
],
"IPAddressStats"
,
System.getenv(
"SPARK_HOME"
), JavaSparkContext.jarOfClass(IPAddressStats.
class
));
JavaRDD<String> lines = ctx.textFile(args[
1
],
1
);
// splits and extracts ip address filed
JavaRDD<String> words = lines.flatMap(
new
FlatMapFunction<String, String>() {
@Override
public
Iterable<String> call(String s) {
// 121.205.198.92 - - [21/Feb/2014:00:00:07 +0800] "GET /archives/417.html HTTP/1.1" 200 11465 "http://shiyanjun.cn/archives/417.html/" "Mozilla/5.0 (Windows NT 5.1; rv:11.0) Gecko/20100101 Firefox/11.0"
// ip address
return
Arrays.asList(SPACE.split(s)[
0
]);
}
});
// map
JavaPairRDD<String, Integer> ones = words.map(
new
PairFunction<String, String, Integer>() {
@Override
public
Tuple2<String, Integer> call(String s) {
return
new
Tuple2<String, Integer>(s,
1
);
}
});
// reduce
JavaPairRDD<String, Integer> counts = ones.reduceByKey(
new
Function2<Integer, Integer, Integer>() {
@Override
public
Integer call(Integer i1, Integer i2) {
return
i1 + i2;
}
});
List<Tuple2<String, Integer>> output = counts.collect();
// sort statistics result by value
Collections.sort(output,
new
Comparator<Tuple2<String, Integer>>() {
@Override
public
int
compare(Tuple2<String, Integer> t1, Tuple2<String, Integer> t2) {
if
(t1._2 < t2._2) {
return
1
;
}
else
if
(t1._2 > t2._2) {
return
-
1
;
}
return
0
;
}
});
writeTo(args, output);
}
private
void
writeTo(String[] args, List<Tuple2<String, Integer>> output) {
for
(Tuple2<?, ?> tuple : output) {
Country country = lookupService.getCountry((String) tuple._1);
LOG.info(
"["
+ country.getCode() +
"] "
+ tuple._1 +
"\t"
+ tuple._2);
}
}
public
static
void
main(String[] args) {
// ./bin/run-my-java-example org.shirdrn.spark.job.IPAddressStats spark://m1:7077 hdfs://m1:9000/user/shirdrn/wwwlog20140222.log /home/shirdrn/cloud/programs/spark-0.9.0-incubating-bin-hadoop1/java-examples/GeoIP_DATABASE.dat
if
(args.length <
3
) {
System.err.println(
"Usage: IPAddressStats <master> <inFile> <GeoIPFile>"
);
System.err.println(
" Example: org.shirdrn.spark.job.IPAddressStats spark://m1:7077 hdfs://m1:9000/user/shirdrn/wwwlog20140222.log /home/shirdrn/cloud/programs/spark-0.9.0-incubating-bin-hadoop1/java-examples/GeoIP_DATABASE.dat"
);
System.exit(
1
);
}
String geoIPFile = args[
2
];
IPAddressStats stats =
new
IPAddressStats(geoIPFile);
stats.stat(args);
System.exit(
0
);
}
}
|
具体实现逻辑,可以参考代码中的注释。我们使用Maven管理构建Java程序,首先看一下我的pom配置中所依赖的软件包,如下所示:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
|
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2
.10
</artifactId>
<version>
0.9.
0
-incubating</version>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>
1.2.
16
</version>
</dependency>
<dependency>
<groupId>dnsjava</groupId>
<artifactId>dnsjava</artifactId>
<version>
2.1.
1
</version>
</dependency>
<dependency>
<groupId>commons-net</groupId>
<artifactId>commons-net</artifactId>
<version>
3.1
</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>
1.2.
1
</version>
</dependency>
</dependencies>
|
需要说明的是,当我们将程序在Spark集群上运行时,它要求我们的编写的Job能够进行序列化,如果某些字段不需要序列化或者无法序列化,可以直接使用transient修饰即可,如上面的属性lookupService没有实现序列化接口,使用transient使其不执行序列化,否则的话,可能会出现类似如下的错误:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
|
14
/
03
/
10
22
:
34
:
06
INFO scheduler.DAGScheduler: Failed to run collect at IPAddressStats.java:
76
Exception in thread
"main"
org.apache.spark.SparkException: Job aborted: Task not serializable: java.io.NotSerializableException: org.shirdrn.spark.job.IPAddressStats
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$
1
.apply(DAGScheduler.scala:
1028
)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$
1
.apply(DAGScheduler.scala:
1026
)
at scala.collection.mutable.ResizableArray$
class
.foreach(ResizableArray.scala:
59
)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:
47
)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:
1026
)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:
794
)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:
737
)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$
4
.apply(DAGScheduler.scala:
741
)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$
4
.apply(DAGScheduler.scala:
740
)
at scala.collection.immutable.List.foreach(List.scala:
318
)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:
740
)
at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:
569
)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$
1
$$anon$
2
$$anonfun$receive$
1
.applyOrElse(DAGScheduler.scala:
207
)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:
498
)
at akka.actor.ActorCell.invoke(ActorCell.scala:
456
)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:
237
)
at akka.dispatch.Mailbox.run(Mailbox.scala:
219
)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:
386
)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:
260
)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:
1339
)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:
1979
)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:
107
)
|
在Spark集群上运行Java程序
这里,我使用了Maven管理构建Java程序,实现上述代码以后,使用Maven的maven-assembly-plugin插件,配置内容如下所示:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
|
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<archive>
<manifest>
<mainClass>org.shirdrn.spark.job.UserAgentStats</mainClass>
</manifest>
</archive>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
<excludes>
<exclude>*.properties</exclude>
<exclude>*.xml</exclude>
</excludes>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>
package
</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
|
将相关依赖库文件都打进程序包里面,最后拷贝JAR文件到Linux系统下(不一定非要在Spark集群的Master节点上),保证该节点上Spark的环境变量配置正确即可看。Spark软件发行包解压缩后,可以看到脚本bin/run-example,我们可以直接修改该脚本,将对应的路径指向我们实现的Java程序包(修改变量EXAMPLES_DIR以及我们的JAR文件存放位置相关的内容),使用该脚本就可以运行,脚本内容如下所示:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
|
cygwin=
false
case
"`uname`"
in
CYGWIN*) cygwin=
true
;;
esac
SCALA_VERSION=2.10
# Figure out where the Scala framework is installed
FWDIR=
"$(cd `dirname $0`/..; pwd)"
# Export this as SPARK_HOME
export SPARK_HOME=
"$FWDIR"
# Load environment variables from conf/spark-env.sh, if it exists
if
[ -e
"$FWDIR/conf/spark-env.sh"
] ; then
. $FWDIR/conf/spark-env.sh
fi
if
[ -z
"$1"
]; then
echo
"Usage: run-example <example-class> [<args>]"
>&2
exit 1
fi
# Figure out the JAR file that our examples were packaged into. This includes a bit of a hack
# to avoid the -sources and -doc packages that are built by publish-local.
EXAMPLES_DIR=
"$FWDIR"
/java-examples
SPARK_EXAMPLES_JAR=
""
if
[ -e
"$EXAMPLES_DIR"
/*.jar ]; then
export SPARK_EXAMPLES_JAR=`ls
"$EXAMPLES_DIR"
/*.jar`
fi
if
[[ -z $SPARK_EXAMPLES_JAR ]]; then
echo
"Failed to find Spark examples assembly in $FWDIR/examples/target"
>&2
echo
"You need to build Spark with sbt/sbt assembly before running this program"
>&2
exit 1
fi
# Since the examples JAR ideally shouldn't include spark-core (that dependency should be
# "provided"), also add our standard Spark classpath, built using compute-classpath.sh.
CLASSPATH=`$FWDIR/bin/compute-classpath.sh`
CLASSPATH=
"$SPARK_EXAMPLES_JAR:$CLASSPATH"
if
$cygwin; then
CLASSPATH=`cygpath -wp $CLASSPATH`
export SPARK_EXAMPLES_JAR=`cygpath -w $SPARK_EXAMPLES_JAR`
fi
# Find java binary
if
[ -n
"${JAVA_HOME}"
]; then
RUNNER=
"${JAVA_HOME}/bin/java"
else
if
[ `command -v java` ]; then
RUNNER=
"java"
else
echo
"JAVA_HOME is not set"
>&2
exit 1
fi
fi
# Set JAVA_OPTS to be able to load native libraries and to set heap size
JAVA_OPTS=
"$SPARK_JAVA_OPTS"
JAVA_OPTS=
"$JAVA_OPTS -Djava.library.path=$SPARK_LIBRARY_PATH"
# Load extra JAVA_OPTS from conf/java-opts, if it exists
if
[ -e
"$FWDIR/conf/java-opts"
] ; then
JAVA_OPTS=
"$JAVA_OPTS `cat $FWDIR/conf/java-opts`"
fi
export JAVA_OPTS
if
[
"$SPARK_PRINT_LAUNCH_COMMAND"
==
"1"
]; then
echo -n
"Spark Command: "
echo
"$RUNNER"
-cp
"$CLASSPATH"
$JAVA_OPTS
"$@"
echo
"========================================"
echo
fi
exec
"$RUNNER"
-cp
"$CLASSPATH"
$JAVA_OPTS
"$@"
|
在Spark上运行我们开发的Java程序,执行如下命令:
1
2
|
cd /home/shirdrn/cloud/programs/spark-
0.9
.
0
-incubating-bin-hadoop1
./bin/run-my-java-example org.shirdrn.spark.job.IPAddressStats spark:
//m1:7077 hdfs://m1:9000/user/shirdrn/wwwlog20140222.log /home/shirdrn/cloud/programs/spark-0.9.0-incubating-bin-hadoop1/java-examples/GeoIP_DATABASE.dat
|
我实现的程序类org.shirdrn.spark.job.IPAddressStats运行需要3个参数:
Spark集群主节点URL:例如我的是spark://m1:7077
输入文件路径:业务相关的,我这里是从HDFS上读取文件hdfs://m1:9000/user/shirdrn/wwwlog20140222.log
GeoIP库文件:业务相关的,用来计算IP地址所属国家的外部文件
如果程序没有错误,能够正常运行,控制台输出程序运行日志,示例如下所示:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
|
14
/
03
/
10
22
:
17
:
24
INFO job.IPAddressStats: GeoIP file: /home/shirdrn/cloud/programs/spark-
0.9
.
0
-incubating-bin-hadoop1/java-examples/GeoIP_DATABASE.dat
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/shirdrn/cloud/programs/spark-
0.9
.
0
-incubating-bin-hadoop1/java-examples/spark-
0.0
.
1
-SNAPSHOT-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.
class
]
SLF4J: Found binding in [jar:file:/home/shirdrn/cloud/programs/spark-
0.9
.
0
-incubating-bin-hadoop1/assembly/target/scala-
2.10
/spark-assembly_2.
10
-
0.9
.
0
-incubating-hadoop1.
0.4
.jar!/org/slf4j/impl/StaticLoggerBinder.
class
]
SLF4J: See http:
//www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
14
/
03
/
10
22
:
17
:
25
INFO slf4j.Slf4jLogger: Slf4jLogger started
14
/
03
/
10
22
:
17
:
25
INFO Remoting: Starting remoting
14
/
03
/
10
22
:
17
:
25
INFO Remoting: Remoting started; listening on addresses :[akka.tcp:
//spark@m1:57379]
14
/
03
/
10
22
:
17
:
25
INFO Remoting: Remoting now listens on addresses: [akka.tcp:
//spark@m1:57379]
14
/
03
/
10
22
:
17
:
25
INFO spark.SparkEnv: Registering BlockManagerMaster
14
/
03
/
10
22
:
17
:
25
INFO storage.DiskBlockManager: Created local directory at /tmp/spark-local-
20140310221725
-c1cb
14
/
03
/
10
22
:
17
:
25
INFO storage.MemoryStore: MemoryStore started with capacity
143.8
MB.
14
/
03
/
10
22
:
17
:
25
INFO network.ConnectionManager: Bound socket to port
45189
with id = ConnectionManagerId(m1,
45189
)
14
/
03
/
10
22
:
17
:
25
INFO storage.BlockManagerMaster: Trying to register BlockManager
14
/
03
/
10
22
:
17
:
25
INFO storage.BlockManagerMasterActor$BlockManagerInfo: Registering block manager m1:
45189
with
143.8
MB RAM
14
/
03
/
10
22
:
17
:
25
INFO storage.BlockManagerMaster: Registered BlockManager
14
/
03
/
10
22
:
17
:
25
INFO spark.HttpServer: Starting HTTP Server
14
/
03
/
10
22
:
17
:
25
INFO server.Server: jetty-
7
.x.y-SNAPSHOT
14
/
03
/
10
22
:
17
:
25
INFO server.AbstractConnector: Started SocketConnector
@0
.0.
0.0
:
49186
14
/
03
/
10
22
:
17
:
25
INFO broadcast.HttpBroadcast: Broadcast server started at http:
//10.95.3.56:49186
14
/
03
/
10
22
:
17
:
25
INFO spark.SparkEnv: Registering MapOutputTracker
14
/
03
/
10
22
:
17
:
25
INFO spark.HttpFileServer: HTTP File server directory is /tmp/spark-56c3e30d-a01b-
4752
-83d1-af1609ab2370
14
/
03
/
10
22
:
17
:
25
INFO spark.HttpServer: Starting HTTP Server
14
/
03
/
10
22
:
17
:
25
INFO server.Server: jetty-
7
.x.y-SNAPSHOT
14
/
03
/
10
22
:
17
:
25
INFO server.AbstractConnector: Started SocketConnector
@0
.0.
0.0
:
52073
14
/
03
/
10
22
:
17
:
26
INFO server.Server: jetty-
7
.x.y-SNAPSHOT
14
/
03
/
10
22
:
17
:
26
INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/storage/rdd,
null
}
14
/
03
/
10
22
:
17
:
26
INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/storage,
null
}
14
/
03
/
10
22
:
17
:
26
INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/stages/stage,
null
}
14
/
03
/
10
22
:
17
:
26
INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/stages/pool,
null
}
14
/
03
/
10
22
:
17
:
26
INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/stages,
null
}
14
/
03
/
10
22
:
17
:
26
INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/environment,
null
}
14
/
03
/
10
22
:
17
:
26
INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/executors,
null
}
14
/
03
/
10
22
:
17
:
26
INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/metrics/json,
null
}
14
/
03
/
10
22
:
17
:
26
INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/
static
,
null
}
14
/
03
/
10
22
:
17
:
26
INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/,
null
}
14
/
03
/
10
22
:
17
:
26
INFO server.AbstractConnector: Started SelectChannelConnector
@0
.0.
0.0
:
4040
14
/
03
/
10
22
:
17
:
26
INFO ui.SparkUI: Started Spark Web UI at http:
//m1:4040
14
/
03
/
10
22
:
17
:
26
INFO spark.SparkContext: Added JAR /home/shirdrn/cloud/programs/spark-
0.9
.
0
-incubating-bin-hadoop1/java-examples/spark-
0.0
.
1
-SNAPSHOT-jar-with-dependencies.jar at http:
//10.95.3.56:52073/jars/spark-0.0.1-SNAPSHOT-jar-with-dependencies.jar with timestamp 1394515046396
14
/
03
/
10
22
:
17
:
26
INFO client.AppClient$ClientActor: Connecting to master spark:
//m1:7077...
14
/
03
/
10
22
:
17
:
26
INFO storage.MemoryStore: ensureFreeSpace(
60341
) called with curMem=
0
, maxMem=
150837657
14
/
03
/
10
22
:
17
:
26
INFO storage.MemoryStore: Block broadcast_0 stored as values to memory (estimated size
58.9
KB, free
143.8
MB)
14
/
03
/
10
22
:
17
:
26
INFO cluster.SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-
20140310221726
-
0000
14
/
03
/
10
22
:
17
:
27
INFO client.AppClient$ClientActor: Executor added: app-
20140310221726
-
0000
/
0
on worker-
20140310221648
-s1-
52544
(s1:
52544
) with
1
cores
14
/
03
/
10
22
:
17
:
27
INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-
20140310221726
-
0000
/
0
on hostPort s1:
52544
with
1
cores,
512.0
MB RAM
14
/
03
/
10
22
:
17
:
27
WARN util.NativeCodeLoader: Unable to load
native
-hadoop library
for
your platform... using builtin-java classes where applicable
14
/
03
/
10
22
:
17
:
27
WARN snappy.LoadSnappy: Snappy
native
library not loaded
14
/
03
/
10
22
:
17
:
27
INFO client.AppClient$ClientActor: Executor updated: app-
20140310221726
-
0000
/
0
is now RUNNING
14
/
03
/
10
22
:
17
:
27
INFO mapred.FileInputFormat: Total input paths to process :
1
14
/
03
/
10
22
:
17
:
27
INFO spark.SparkContext: Starting job: collect at IPAddressStats.java:
77
14
/
03
/
10
22
:
17
:
27
INFO scheduler.DAGScheduler: Registering RDD
4
(reduceByKey at IPAddressStats.java:
70
)
14
/
03
/
10
22
:
17
:
27
INFO scheduler.DAGScheduler: Got job
0
(collect at IPAddressStats.java:
77
) with
1
output partitions (allowLocal=
false
)
14
/
03
/
10
22
:
17
:
27
INFO scheduler.DAGScheduler: Final stage: Stage
0
(collect at IPAddressStats.java:
77
)
14
/
03
/
10
22
:
17
:
27
INFO scheduler.DAGScheduler: Parents of
final
stage: List(Stage
1
)
14
/
03
/
10
22
:
17
:
27
INFO scheduler.DAGScheduler: Missing parents: List(Stage
1
)
14
/
03
/
10
22
:
17
:
27
INFO scheduler.DAGScheduler: Submitting Stage
1
(MapPartitionsRDD[
4
] at reduceByKey at IPAddressStats.java:
70
), which has no missing parents
14
/
03
/
10
22
:
17
:
27
INFO scheduler.DAGScheduler: Submitting
1
missing tasks from Stage
1
(MapPartitionsRDD[
4
] at reduceByKey at IPAddressStats.java:
70
)
14
/
03
/
10
22
:
17
:
27
INFO scheduler.TaskSchedulerImpl: Adding task set
1.0
with
1
tasks
14
/
03
/
10
22
:
17
:
28
INFO cluster.SparkDeploySchedulerBackend: Registered executor: Actor[akka.tcp:
//sparkExecutor@s1:59233/user/Executor#-671170811] with ID 0
14
/
03
/
10
22
:
17
:
28
INFO scheduler.TaskSetManager: Starting task
1.0
:
0
as TID
0
on executor
0
: s1 (PROCESS_LOCAL)
14
/
03
/
10
22
:
17
:
28
INFO scheduler.TaskSetManager: Serialized task
1.0
:
0
as
2396
bytes in
5
ms
14
/
03
/
10
22
:
17
:
29
INFO storage.BlockManagerMasterActor$BlockManagerInfo: Registering block manager s1:
47282
with
297.0
MB RAM
14
/
03
/
10
22
:
17
:
32
INFO scheduler.TaskSetManager: Finished TID
0
in
3376
ms on s1 (progress:
0
/
1
)
14
/
03
/
10
22
:
17
:
32
INFO scheduler.DAGScheduler: Completed ShuffleMapTask(
1
,
0
)
14
/
03
/
10
22
:
17
:
32
INFO scheduler.DAGScheduler: Stage
1
(reduceByKey at IPAddressStats.java:
70
) finished in
4.420
s
14
/
03
/
10
22
:
17
:
32
INFO scheduler.DAGScheduler: looking
for
newly runnable stages
14
/
03
/
10
22
:
17
:
32
INFO scheduler.DAGScheduler: running: Set()
14
/
03
/
10
22
:
17
:
32
INFO scheduler.DAGScheduler: waiting: Set(Stage
0
)
14
/
03
/
10
22
:
17
:
32
INFO scheduler.DAGScheduler: failed: Set()
14
/
03
/
10
22
:
17
:
32
INFO scheduler.TaskSchedulerImpl: Remove TaskSet
1.0
from pool
14
/
03
/
10
22
:
17
:
32
INFO scheduler.DAGScheduler: Missing parents
for
Stage
0
: List()
14
/
03
/
10
22
:
17
:
32
INFO scheduler.DAGScheduler: Submitting Stage
0
(MapPartitionsRDD[
6
] at reduceByKey at IPAddressStats.java:
70
), which is now runnable
14
/
03
/
10
22
:
17
:
32
INFO scheduler.DAGScheduler: Submitting
1
missing tasks from Stage
0
(MapPartitionsRDD[
6
] at reduceByKey at IPAddressStats.java:
70
)
14
/
03
/
10
22
:
17
:
32
INFO scheduler.TaskSchedulerImpl: Adding task set
0.0
with
1
tasks
14
/
03
/
10
22
:
17
:
32
INFO scheduler.TaskSetManager: Starting task
0.0
:
0
as TID
1
on executor
0
: s1 (PROCESS_LOCAL)
14
/
03
/
10
22
:
17
:
32
INFO scheduler.TaskSetManager: Serialized task
0.0
:
0
as
2255
bytes in
1
ms
14
/
03
/
10
22
:
17
:
32
INFO spark.MapOutputTrackerMasterActor: Asked to send map output locations
for
shuffle
0
to spark
@s1
:
33534
14
/
03
/
10
22
:
17
:
32
INFO spark.MapOutputTrackerMaster: Size of output statuses
for
shuffle
0
is
120
bytes
14
/
03
/
10
22
:
17
:
32
INFO scheduler.TaskSetManager: Finished TID
1
in
282
ms on s1 (progress:
0
/
1
)
14
/
03
/
10
22
:
17
:
32
INFO scheduler.DAGScheduler: Completed ResultTask(
0
,
0
)
14
/
03
/
10
22
:
17
:
32
INFO scheduler.DAGScheduler: Stage
0
(collect at IPAddressStats.java:
77
) finished in
0.314
s
14
/
03
/
10
22
:
17
:
32
INFO scheduler.TaskSchedulerImpl: Remove TaskSet
0.0
from pool
14
/
03
/
10
22
:
17
:
32
INFO spark.SparkContext: Job finished: collect at IPAddressStats.java:
77
, took
4.870958309
s
14
/
03
/
10
22
:
17
:
32
INFO job.IPAddressStats: [CN]
58.246
.
49.218
312
14
/
03
/
10
22
:
17
:
32
INFO job.IPAddressStats: [KR]
1.234
.
83.77
300
14
/
03
/
10
22
:
17
:
32
INFO job.IPAddressStats: [CN]
120.43
.
11.16
212
14
/
03
/
10
22
:
17
:
32
INFO job.IPAddressStats: [CN]
110.85
.
72.254
207
14
/
03
/
10
22
:
17
:
32
INFO job.IPAddressStats: [CN]
27.150
.
229.134
185
14
/
03
/
10
22
:
17
:
32
INFO job.IPAddressStats: [HK]
180.178
.
52.181
181
14
/
03
/
10
22
:
17
:
32
INFO job.IPAddressStats: [CN]
120.37
.
210.212
180
14
/
03
/
10
22
:
17
:
32
INFO job.IPAddressStats: [CN]
222.77
.
226.83
176
14
/
03
/
10
22
:
17
:
32
INFO job.IPAddressStats: [CN]
120.43
.
11.205
169
14
/
03
/
10
22
:
17
:
32
INFO job.IPAddressStats: [CN]
120.43
.
9.19
165
...
|
我们也可以通过Web控制台来查看当前执行应用程序(Application)的状态信息,通过Master节点的8080端口(如:http://m1:8080/)就能看到集群的应用程序(Application)状态信息。
另外,需要说明的时候,如果在Unix环境下使用Eclipse使用Java开发Spark应用程序,也能够直接通过Eclipse连接Spark集群,并提交开发的应用程序,然后交给集群去处理。