MapReduce高级编程

MapReduce 计数器、最值：

计数器

数据集在进行MapReduce运算过程中，许多时候，用户希望了解待分析的数据的运行的运行情况。Hadoop内置的计数器功能收集作业的主要统计信息，可以帮助用户理解程序的运行情况，辅助用户诊断故障。

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
18/12/28 10:37:46 INFO client.RMProxy: Connecting to ResourceManager at datanode3/192.168.1.103:8032
18/12/28 10:37:48 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
18/12/28 10:37:50 INFO input.FileInputFormat: Total input paths to process : 2
18/12/28 10:37:50 INFO mapreduce.JobSubmitter: number of splits:2
18/12/28 10:37:51 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1545964109134_0001
18/12/28 10:37:53 INFO impl.YarnClientImpl: Submitted application application_1545964109134_0001
18/12/28 10:37:54 INFO mapreduce.Job: The url to track the job: http://datanode3:8088/proxy/application_1545964109134_0001/
18/12/28 10:37:54 INFO mapreduce.Job: Running job: job_1545964109134_0001
18/12/28 10:38:50 INFO mapreduce.Job: Job job_1545964109134_0001 running in uber mode : false
18/12/28 10:38:50 INFO mapreduce.Job:  map 0% reduce 0%
18/12/28 10:39:28 INFO mapreduce.Job:  map 100% reduce 0%
18/12/28 10:39:48 INFO mapreduce.Job:  map 100% reduce 100%
18/12/28 10:39:50 INFO mapreduce.Job: Job job_1545964109134_0001 completed successfully
18/12/28 10:39:51 INFO mapreduce.Job: Counters: 49
        File System Counters
                FILE: Number of bytes read=78
                FILE: Number of bytes written=353015
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=258
                HDFS: Number of bytes written=31
                HDFS: Number of read operations=9
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=2
        Job Counters
                Launched map tasks=2
                Launched reduce tasks=1
                Data-local map tasks=2
                Total time spent by all maps in occupied slots (ms)=67297
                Total time spent by all reduces in occupied slots (ms)=16699
                Total time spent by all map tasks (ms)=67297
                Total time spent by all reduce tasks (ms)=16699
                Total vcore-milliseconds taken by all map tasks=67297
                Total vcore-milliseconds taken by all reduce tasks=16699
                Total megabyte-milliseconds taken by all map tasks=68912128
                Total megabyte-milliseconds taken by all reduce tasks=17099776
        Map-Reduce Framework
                Map input records=8
                Map output records=8
                Map output bytes=78
                Map output materialized bytes=84
                Input split bytes=212
                Combine input records=8
                Combine output records=6
                Reduce input groups=4
                Reduce shuffle bytes=84
                Reduce input records=6
                Reduce output records=4
                Spilled Records=12
                Shuffled Maps =2
                Failed Shuffles=0
                Merged Map outputs=2
                GC time elapsed (ms)=3303
                CPU time spent (ms)=8060
                Physical memory (bytes) snapshot=470183936
                Virtual memory (bytes) snapshot=6182424576
                Total committed heap usage (bytes)=261361664
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=46
        File Output Format Counters
                Bytes Written=31

这些记录了该程序运行过程的的一些信息的计数，如Map input records=8，表示Map有8条记录。可以看出来这些内置计数器可以被分为若干个组，即对于大多数的计数器来说，Hadoop使用的组件分为若干类。

计数器列表

组别	名称/类别
MapReduce任务计数器（Map-Reduce Framework）	org.apache.hadoop.mapreduce.TaskCounter
文件系统计数器（File System Counters）	org.apache.hadoop.mapreduce.FiIeSystemCounter
输入文件任务计数器（File Input Format Counters）	org.apache.hadoop.mapreduce.lib.input.FilelnputFormatCounter
输出文件计数器（File Output Format Counters）	org.apache.hadoop.mapreduce.lib.output.FileOutputFormatCounter
作业计数器（Job Counters）	org.apache.hadoop.mapreduce.JobCounter

大部分的Hadoop都有相应的计数器，可以对其进行追踪，方便处理运行中出现的问题，这些信息从应用角度又分为任务计数器和作业计数器：

任务计数器

内置MapReduce任务计数器

计数器名称	说明
map输人的记录数(MAP_INPUT_RECORDS）	作业中所有map已处理的输人记录数。每次RecordReader读到一条记录并将其传给map的map()函数时，该计数器的值递增
分片（split）的原始字节数(SPLIT_RAW_BYTES)	由map读取的输人分片对象的字节数。这些对象描述分片元数据（文件的位移和长度），而不是分片的数据自身，因此总规模是小的
map输出的记录数(MAP_OUTPUT_RECORDS)	作业中所有map产生的map输出记录数。每次某一个map 的OutputCollector调用collect()方法时，该计数器的值增加
map输出的字节数(MAP_OUTPUT_BYTES)	作业中所有map产生的耒经压缩的输出数据的字节数·每次某一个map的OutputCollector调用collect()方法时，该计数器的值增加
map输出的物化字节数（MAP_OUTPUT_MATERIALIZED_BYTES)	map输出后确实写到磁盘上的字节数；若map输出压缩功能被启用，则会在计数器值上反映出来
combine输人的记录数(COMBINE_INPUT_RECORDS)	作业中所有combiner(如果有）已处理的输人记录数。combiner的迭代器每次读一个值，该计数器的值增加。注意：本计数器代表combiner已经处理的值的个数，并非不同的键组数（后者并无实所意文，因为对于combiner而言，并不要求每个键对应一个组。
combine输出的记录数(COMBINE_OUTPUT_RECORDS)	作业中所有combiner（如果有）已产生的输出记录数。每当一个combiner的OutputCollector调用collect()方法时，该计数器的值增加
reduce输人的组（REDUCE_INPUT_GROUPS）	作业中所有reducer已经处理的不同的码分组的个数。每当某一个reducer的reduce()被调用时，该计数器的值增加。
reduce输人的记录数（REDUCE_INPUT_RECORDS)	作业中所有reducer已经处理的输人记录的个数。每当某个reducer的迭代器读一个值时，该计数器的值增加。如果所有reducer已经处理数完所有输人，則该计数器的值与计数器”map输出的记录”的值相同。
reduce输出的记录数（REDUCE_OUTPUT_RECORDS）	作业中所有map已经产生的reduce输出记录数。每当某个reducer的OutputCollector调用collect()方法时，该计数器的值增加。
reduce经过shuffle的字节数(REDUCE_SHUFFLE_BYTES)	由shuffle复制到reducer的map输出的字节数。
溢出的记录数(SPILLED_RECORDS)	作业中所有map和reduce任务溢出到磁盘的记录数
CPU毫秒(CPU_MILLISECONDS)	一个任务的总CPU时间，以毫秒为单位，可由/proc/cpuinfo获取
物理内存字节数（PHYSICAL_MEMORY_BYTES）	一个任务所用的物理内存，以字节数为单位，可由/proc/meminfo获取
虚拟内存字节数(VIRTUAL_MEMORY_BYTES）	一个任务所用虚拟内存的字节数，由/proc/meminfo获取
有效的堆字节数(COMMITTED_HEAP_BYTES)	在JVM中的总有效内存最（以字节为单位），可由Runtime. getRuntime().totalMemory()获取
GC运行时间毫秒数(GC_TIME_MILLIS)	在任务执行过程中，垃圾收集器(garbage collection）花费的时间（以毫秒为单位），可由GarbageCollector MXBean. getCollectionTime()获取
由shuffle传输的map输出数(SHUFFLED_MAPS)	由shume传输到reducer的map输出文件数。
失敗的shuffle数(FAILED_SHUFFLE)	shuffle过程中，发生map输出拷贝错误的次数
被合并的map输出数（MERGED_MAP_OUTPUTS）	shuffle过程中，在reduce端合并的map输出文件数

内置文件系统任务计数器

内置的输入文件任务计数器

计数器名称	说明
读取的字节数(BYTES_READ)	由map任务通过FilelnputFormat读取的字节数

内置输出文件任务计数器

计数器名称	说明
写的字节数(BYTES_WRITTEN)	由map任务（针对仅含map的作业）或者reduce任务通过FileOutputFormat写的字节数

作业计数器

内置的作业计数器

计数器名称	说明
启用的map任务数（TOTAL_LAUNCHED_MAPS）	启动的map任务数，包括以“推测执行”方式启动的任务。
启用的reduce任务数(TOTAL_LAUNCHED_REDUCES)	启动的reduce任务数，包括以“推测执行”方式启动的任务。
启用的uber任务数(TOTAL_LAIÆHED_UBERTASKS)	启用的uber任务数。
uber任务中的map数(NUM_UBER_SUBMAPS)	在uber任务中的map数。
Uber任务中的reduce数(NUM_UBER_SUBREDUCES)	在任务中的reduce数。
失败的map任务数（NUM_FAILED_MAPS）	失败的map任务数。
失败的reduce任务数(NUM_FAILED_REDUCES)	失败的reduce任务数
失败的uber任务数(NIN_FAILED_UBERTASKS)	失败的uber任务数。
被中止的map任务数（NUM_KILLED_MAPS）	被中止的map任务数。
被中止的reduce任务数(NW_KILLED_REDUCES)	被中止的reduce任务数。
数据本地化的map任务数（DATA_LOCAL_MAPS）	与输人数据在同一节点上的map任务数。
机架本地化的map任务数（RACK_LOCAL_MAPS)	与输人数据在同一机架范围内但不在同一节点上的map任务数。
其他本地化的map任务数（OTHER_LOCAL_MAPS）	与输人数据不在同一机架范围内的map任务数。由于机架之间的带宽资源相对较少，Hadoop会尽量让map任务靠近输人数据执行，因此该计数器值一般比较小。
map任务的总运行时间(MILLIS_MAPS)	map任务的总运行时间，单位毫秒。包括以推测执行方式启动的任务。可参见相关的度量内核和内存使用的计数器(VCORES_MILLIS_MAPS和MB_MILLIS_MAPS）
reduce任务的总运行时间(MILLIS_REDUCES)	reduce任务的总运行时间，单位毫秒。包括以推滌执行方式启动的任务。可参见相关的度量内核和内存使用的计数器(VQES_MILLIS_REARES和t*B_MILLIS_REUKES)

计数器名称	说明
文件系统的读字节数（BYTES_READ）	由map任务和reduce任务在各个文件系统中读取的字节数，各个文件系统分别对应一个计数器，文件系统可以是local、 HDFS、S3等
文件系统的写字节数(BYTES_WRITTEN）	由map任务和reduce任务在各个文件系统中写的字节数
文件系统读操作的数量(READ_OPS)	由map任务和reduce任务在各个文件系统中进行的读操作的数量（例如，open操作，filestatus操作）
文件系统大规模读操作的数量(LARGE_READ_OPS)	由map和reduce任务在各个文件系统中进行的大规模读操作（例如，对于一个大容量目录进行list操作）的数量
文件系统写操作的数量(WRITE_OPS)	由map任务和reduce任务在各个文件系统中进行的写操作的数量（例如，create操作，append操作）

自定义计数器

虽然Hadoop内置的计数器比较全面，给作业运行过程的监控带了方便，但是对于那一些业务中的特定要求(统计过程中对某种情况发生进行计数统计)MapReduce还是提供了用户编写自定义计数器的方法。

过程

定义一个Java的枚举类型(enum)，用于记录计数器分组，其枚举类型的名称即为分组的名称，枚举类型的字段就是计数器名称。
通过Context类的实例调用getCounter方法进行increment(long incr)方法，进行计数的添加。

案例

ReportTest

public enum ReportTest {                //定义枚举
    ErroWord, GoodWord, ReduceReport	//写入要记录的计数器名称
}

Mapper类

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class TxtMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

        String[] words = value.toString().split(" ");

        for (String word : words) {
            if (word.equals("GoodWord")) {
                context.setStatus("GoodWord is coming");
                context.getCounter(ReportTest.GoodWord).increment(1);
            } else if (word.equals("ErroWord")) {
                context.setStatus("BadWord is coming!");
                context.getCounter(ReportTest.ErroWord).increment(1);
            } else {
                context.write(new Text(word), new IntWritable(1));
            }

        }
    }
}

Reducer类

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;
import java.util.Iterator;

public class TxtReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int sum = 0;

        Iterator<IntWritable> it = values.iterator();
        while (it.hasNext()) {
            IntWritable value = it.next();
            sum += value.get();
        }
        if (key.toString().equals("hello")) {
            context.setStatus("BadKey is comming!");
            context.getCounter(ReportTest.ReduceReport).increment(1);
        }
        context.write(key, new IntWritable(sum));
    }
}

ToolRunnerJS

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Counters;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;

public class ToolRunnerJS extends Configured implements Tool {
    public static void main(String[] args) throws Exception {
        ToolRunnerJS tool = new ToolRunnerJS();
        tool.run(null);
    }

    public int run(String[] args0) throws Exception {
        //Configuration:MapReduce的类,向Hadoop框架描述MapReduce执行工作
        Configuration conf = new Configuration();
        String output = "jishuqi1";

        Job job = Job.getInstance(conf);
        job.setJarByClass(ToolRunnerJS.class);
        job.setJobName("jishu");               //设置Job名称

        job.setOutputKeyClass(Text.class); //设置Job输出数据 K
        job.setOutputValueClass(IntWritable.class); //设置Job输出数据 V

        job.setMapperClass(TxtMapper.class);
        job.setReducerClass(TxtReducer.class);

        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);

        FileInputFormat.addInputPath(job, new Path("/input/counter/*")); //为 Job设置输入路径
        FileOutputFormat.setOutputPath(job, new Path("/output/counter_result")); //为Job设置输出路径

        job.waitForCompletion(true);
        Counters counters = job.getCounters();
        System.out.println("Counter getGroupNames:"+counters.getGroupNames());
        return 0;
    }
}

查看

通过Web界面也可以查看但是需要设置设置

 <property>
           <name>mapreduce.jobhistory.address</name>
	       <value>datanode1:10020</value>
           <description>MapReduce  JobHistory Server IPC host:port</description>
</property>

<property>
        <name>mapreduce.jobhistory.webapp.address</name>
        <value>datanode1:19888</value>
        <description>MapReduce JobHistory Server Web UI host:port</description>
</property>

启动服务

mr-jobhistory-daemon.sh start historyserver

web界面查看

最值

最大值、最小值、平均值、均方差、众数、中位数等都是统计学中经典的数值统计，也是常用的统计属性字段，如果想知道最大的10个数，最小的10个数，这涉及到Top N/Bottom N 问题。

单一最值

常用的统计属性的字段在MapReduce的求解过程中，由一个大任务分解成若干个Mapper任务，最后会进行Reducer合并，比传统计算求解略显复杂，在MaoReduce框架中，会以Key进行分区、分组、排序的操作，在进行这些数值的操作时哦，只要设定合理的key，整个问题也就简单化了。使用Combiner可以减少Shuffle到Reduce端中间的K V的数目，减轻网络和IO的目的。

求解最大值最小值

数据

2017-10 300
2017-10 100
2017-10 200
2017-11 320
2017-11 200
2017-11 280
2017-12 290
2017-12 270

MinMaxWritable

import org.apache.hadoop.io.Writable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

public class MinMaxWritable implements Writable {
    private int min;//记录最大值
    private int max;//记录最小值

    public int getMin() {
        return min;
    }

    public void setMin(int min) {
        this.min = min;
    }

    public int getMax() {
        return max;
    }

    @Override
    public String toString() {
        return min + "\t" + max;
    }

    public void setMax(int max) {
        this.max = max;
    }

    public void write(DataOutput out) throws IOException {
        out.writeInt(max);
        out.writeInt(min);
    }

    public void readFields(DataInput in) throws IOException {
        min = in.readInt();
        max = in.readInt();

    }
}

MinMaxMapper

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class MinMaxMapper extends Mapper<Object, Text, Text, MinMaxWritable> {
    private MinMaxWritable outTuple = new MinMaxWritable();

    @Override
    protected void map(Object key, Text value, Context context) throws IOException, InterruptedException {

        String[] words = value.toString().split(" ");
        String data = words[0]; //定义记录的日期的自定义变量data
        if (data == null) {
            return;  //如果该日期为空，返回
        }
        outTuple.setMin(Integer.parseInt(words[1]));
        outTuple.setMax(Integer.parseInt(words[1]));
        context.write(new Text(data), outTuple);  //将结果写入到context
    }
}

MinMaxReducer

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class MinMaxReducer extends Reducer<Text, MinMaxWritable, Text, MinMaxWritable> {
    private MinMaxWritable result = new MinMaxWritable();

    @Override
    protected void reduce(Text key, Iterable<MinMaxWritable> values, Context context) throws IOException, InterruptedException {
        result.setMax(0);
        result.setMin(0);
        //按照key迭代输出value的值
        for (MinMaxWritable value : values) {
            //最小值放入结果集
            if (result.getMin() == 0 || value.getMin() < result.getMin()) {
                result.setMin(value.getMin());
            }
            //最大值放入结果集
            if (result.getMax() == 0 || value.getMax() > result.getMax()) {
                result.setMax(value.getMax());
            }
        }
        context.write(key, result);
    }
}

MinMaxJob

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class MinMaxJob {
    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
        if (otherArgs.length != 2) {
            System.err.println("Usage:MinMaxMapper<in><out>");
            System.exit(2);
        }
        Job job = Job.getInstance(conf);
        job.setJarByClass(MinMaxJob.class);
        job.setMapperClass(MinMaxMapper.class);
        //启用Combiner 减少网络传输的数据量 
        job.setCombinerClass(MinMaxReducer.class);
        job.setReducerClass(MinMaxReducer.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(MinMaxWritable.class);

        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

计算过程

在一个MapReduce计算的过程中，Mapper任务相对于Reduce任务是大量的，因此少量的Reducer处理大量数据的并不明智，所以通过在Shuffle阶段引入Combiner，并把Reducer作为它的计算类，大大减少了Reducer端数据的输入，整个计算过程变得合理可靠。

计数器