Hadoop Counter

Hadoop Counte是Hadoop自带的一个很实用的功能，它可以统计全局某个量的数量，如，MR中用Kafka发送消息，就可以用Hadoop Counter统计发送成功信息的条数、发送失败信息的条数，以及发送信息的总条数。其实我们每个MapReduce Job跑完都有Counter打印：

16/06/05 00:25:19 INFO mapreduce.Job: Counters: 50
File System Counters
	FILE: Number of bytes read=42
	FILE: Number of bytes written=185609
	FILE: Number of read operations=0
	FILE: Number of large read operations=0
	FILE: Number of write operations=0
	HDFS: Number of bytes read=139
	HDFS: Number of bytes written=8
	HDFS: Number of read operations=6
	HDFS: Number of large read operations=0
	HDFS: Number of write operations=2
Job Counters 
	Launched map tasks=1
	Launched reduce tasks=1
	Data-local map tasks=1
	Total time spent by all maps in occupied slots (ms)=3585
	Total time spent by all reduces in occupied slots (ms)=3174
	Total time spent by all map tasks (ms)=3585
	Total time spent by all reduce tasks (ms)=3174
	Total vcore-seconds taken by all map tasks=3585
	Total vcore-seconds taken by all reduce tasks=3174
	Total megabyte-seconds taken by all map tasks=3671040
	Total megabyte-seconds taken by all reduce tasks=3250176
Map-Reduce Framework
	Map input records=3
	Map output records=3
	Map output bytes=30
	Map output materialized bytes=42
	Input split bytes=102
	Combine input records=0
	Combine output records=0
	Reduce input groups=1
	Reduce shuffle bytes=42
	Reduce input records=3
	Reduce output records=1
	Spilled Records=6
	Shuffled Maps =1
	Failed Shuffles=0
	Merged Map outputs=1
	GC time elapsed (ms)=190
	CPU time spent (ms)=1310
	Physical memory (bytes) snapshot=218619904
	Virtual memory (bytes) snapshot=725868544
	Total committed heap usage (bytes)=136908800
Hello_Counter
	Number=3
Shuffle Errors
	BAD_ID=0
	CONNECTION=0
	IO_ERROR=0
	WRONG_LENGTH=0
	WRONG_MAP=0
	WRONG_REDUCE=0
File Input Format Counters 
	Bytes Read=37
File Output Format Counters 
	Bytes Written=8

而每个Counter又属于一个Counter Group。如上：

16/06/05 00:25:19 INFO mapreduce.Job: Counters: 50：表示一共有50个Counter。
File System Counters、Job Counters、Map-Reduce Framework等就是Counter Group。
FILE: Number of bytes read=42、Launched map tasks=1等这些分布于每个Counter Group下的就是该Group中的Counter。

下面就用简单的WordCount作为例子，统计“Hello”的个数：

import java.io.IOException;
import java.util.StringTokenizer;
 
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Counter;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class WordCount
{

	public static class TokenizerMapper 
		extends Mapper<Object, Text, Text, IntWritable>
	{

		private final static IntWritable one = new IntWritable(1);
		private Text word = new Text();
        public static String COUNT = "Hello_Counter";
        public static String NUMB = "Number";

        public void map(Object key, Text value, Context context
			) throws IOException, InterruptedException
		{
			String[] line = value.toString().trim().split(" ");
			for(String itr : line){
				word.set(itr);
                if(itr.trim().equals("hello")){
                    context.getCounter(COUNT, NUMB).increment(1);
                    context.write(word, one);
                }
            }
		}
	}

	public static class IntSumReducer 
		extends Reducer<Text,IntWritable,Text,IntWritable> 
	{
		private IntWritable result = new IntWritable();


		public void reduce(Text key, Iterable<IntWritable> values, 
			Context context) throws IOException, InterruptedException
		{
			int sum = 0;
			for (IntWritable val : values) 
			{
				sum += val.get();
			}

			result.set(sum);
			context.write(key, result);
		}
	}

	public static void main(String[] args) throws Exception {
		Configuration conf = new Configuration();
		String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
		if (otherArgs.length != 2) {
			System.err.println("Usage: wordcount <in> <out>");
			System.exit(2);
		}
		@SuppressWarnings("deprecation")
		Job job = new Job(conf, "word count");
		job.setJarByClass(WordCount.class);
		job.setMapperClass(TokenizerMapper.class);
		job.setReducerClass(IntSumReducer.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);
		FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
		FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
        job.waitForCompletion(true);

//        Counter count = job.getCounters().findCounter(TokenizerMapper.COUNT, TokenizerMapper.NUMB);
        Counter count = job.getCounters().getGroup(TokenizerMapper.COUNT).findCounter(TokenizerMapper.NUMB);
        System.out.println("==============" + count.getValue());
    }
}

至于Counter更多API的使用请参考 Hadoop Javadoc。

输出：

16/06/05 00:24:55 INFO client.RMProxy: Connecting to ResourceManager at slave-1/192.168.253.11:8032
16/06/05 00:24:56 INFO input.FileInputFormat: Total input paths to process : 1
16/06/05 00:24:56 INFO mapreduce.JobSubmitter: number of splits:1
16/06/05 00:24:56 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1465111446224_0001
16/06/05 00:24:57 INFO impl.YarnClientImpl: Submitted application application_1465111446224_0001
16/06/05 00:24:58 INFO mapreduce.Job: The url to track the job: http://slave-1:8088/proxy/application_1465111446224_0001/
16/06/05 00:24:58 INFO mapreduce.Job: Running job: job_1465111446224_0001
16/06/05 00:25:08 INFO mapreduce.Job: Job job_1465111446224_0001 running in uber mode : false
16/06/05 00:25:08 INFO mapreduce.Job:  map 0% reduce 0%
16/06/05 00:25:14 INFO mapreduce.Job:  map 100% reduce 0%
16/06/05 00:25:19 INFO mapreduce.Job:  map 100% reduce 100%
16/06/05 00:25:19 INFO mapreduce.Job: Job job_1465111446224_0001 completed successfully
16/06/05 00:25:19 INFO mapreduce.Job: Counters: 50
File System Counters
	FILE: Number of bytes read=42
	FILE: Number of bytes written=185609
	FILE: Number of read operations=0
	FILE: Number of large read operations=0
	FILE: Number of write operations=0
	HDFS: Number of bytes read=139
	HDFS: Number of bytes written=8
	HDFS: Number of read operations=6
	HDFS: Number of large read operations=0
	HDFS: Number of write operations=2
Job Counters 
	Launched map tasks=1
	Launched reduce tasks=1
	Data-local map tasks=1
	Total time spent by all maps in occupied slots (ms)=3585
	Total time spent by all reduces in occupied slots (ms)=3174
	Total time spent by all map tasks (ms)=3585
	Total time spent by all reduce tasks (ms)=3174
	Total vcore-seconds taken by all map tasks=3585
	Total vcore-seconds taken by all reduce tasks=3174
	Total megabyte-seconds taken by all map tasks=3671040
	Total megabyte-seconds taken by all reduce tasks=3250176
Map-Reduce Framework
	Map input records=3
	Map output records=3
	Map output bytes=30
	Map output materialized bytes=42
	Input split bytes=102
	Combine input records=0
	Combine output records=0
	Reduce input groups=1
	Reduce shuffle bytes=42
	Reduce input records=3
	Reduce output records=1
	Spilled Records=6
	Shuffled Maps =1
	Failed Shuffles=0
	Merged Map outputs=1
	GC time elapsed (ms)=190
	CPU time spent (ms)=1310
	Physical memory (bytes) snapshot=218619904
	Virtual memory (bytes) snapshot=725868544
	Total committed heap usage (bytes)=136908800
Hello_Counter
	Number=3
Shuffle Errors
	BAD_ID=0
	CONNECTION=0
	IO_ERROR=0
	WRONG_LENGTH=0
	WRONG_MAP=0
	WRONG_REDUCE=0
File Input Format Counters 
	Bytes Read=37
File Output Format Counters 
	Bytes Written=8
==============3

如输出中的Counter Group：Hello_Counter，Counter：Number=3

注意：有些Hadoop版本不会把自定义的Counter和File System Counters、Job Counters并列地列在一起，所以建议最后还是自己打印出来比较好。

猜你喜欢