经过上篇文章的源码分析，我们已经大致知道了客户端提交任务都做了什么，
本篇是来继续分析，Map Task，看完Map Task的核心源码，我相信会对以后的学习产生很大的帮助

我这里工具是IDEA，Hadoop是2.7.2,
首先，先来分析简单的Map的输入Input过程,后面再来分析OutPut过程

Map Input核心源码分析

前面可以知道，在集群环境运行，我们的客户端会联系ResourceManager,ResourceManager会帮我们创建一个ApplicationMaster，我们的ApplicationMaster会向ResourceManager申请Container,Container和RM通信，可能会生成多个Container,Container是一个JVM进程，在里面，会反射生成MapTask,和ReduceTask这样的类变成对象，接下来我们分析map task是怎么工作的，一些资源调度的这里不关注

位置在这里 org/apache/hadoop/mapreduce/Mapper.java
在这里插入图片描述

然后，我们往下看看，会看到一个run方法

	//run方法，我们的Mapper方法运行的话，一定会调用这个run方法，这里就是入口
public void run(Context context) throws IOException, InterruptedException {
    //运行前把我们的Context上下文放到里面，Context会包含我们的输入和输出
	setup(context);
    try {
    //循环
      while (context.nextKeyValue()) {
      //我们之间进去map方法
        map(context.getCurrentKey(), context.getCurrentValue(), context);
      }
    } finally {
      cleanup(context);
    }
  }

进去我们的Map Task 看见了一个run方法，这里的run方法实际是真正的执行者，是被上面调用的


public void run(final JobConf job, final TaskUmbilicalProtocol umbilical)
    throws IOException, ClassNotFoundException, InterruptedException {
    this.umbilical = umbilical;

    if (isMapTask()) {
      // If there are no reducers then there won't be any sort. Hence the map 
      // phase will govern the entire attempt's progress.
      
      //	从配置信息中取，如果我们的reduce数量设置的是0的话
      if (conf.getNumReduceTasks() == 0) {
      // 只有map处理
        mapPhase = getProgress().addPhase("map", 1.0f);
      } else {
        // If there are reducers then the entire attempt's progress will be 
        // split between the map phase (67%) and the sort phase (33%).
        
        // 如果有1个reduce或者多个reduce
        //map处理
        mapPhase = getProgress().addPhase("map", 0.667f);
        //加入了sort排序
        sortPhase  = getProgress().addPhase("sort", 0.333f);
      }
    }
    TaskReporter reporter = startReporter(umbilical);
 	// 下面一些调度的事情先不看
    boolean useNewApi = job.getUseNewMapper();
    initialize(job, getJobID(), reporter, useNewApi);

    // check if it is a cleanupJobTask
    if (jobCleanup) {
      runJobCleanupTask(umbilical, reporter);
      return;
    }
    if (jobSetup) {
      runJobSetupTask(umbilical, reporter);
      return;
    }
    if (taskCleanup) {
      runTaskCleanupTask(umbilical, reporter);
      return;
    }
	//是否使用新的API
    if (useNewApi) {
     //我们的hadoop版本是2.x，所以肯定是走这里
      runNewMapper(job, splitMetaInfo, umbilical, reporter);
    } else {
      runOldMapper(job, splitMetaInfo, umbilical, reporter);
    }
    done(umbilical, reporter);
  }

我们现在到runNewMapper方法里面，看看做了什么


 private <INKEY,INVALUE,OUTKEY,OUTVALUE>
  void runNewMapper(final JobConf job,
                    final TaskSplitIndex splitIndex,
                    final TaskUmbilicalProtocol umbilical,
                    TaskReporter reporter
                    ) throws IOException, ClassNotFoundException,
                             InterruptedException {
    // make a task context so we can get the classes
    // 我们先不看这些东西，找到try Catch
    org.apache.hadoop.mapreduce.TaskAttemptContext taskContext =
      new org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl(job, 
                                                                  getTaskID(),
                                                                  reporter);
    // make a mapper
    org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE> mapper =
      (org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE>)
        ReflectionUtils.newInstance(taskContext.getMapperClass(), job);
    // make the input format
    org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE> inputFormat =
      (org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE>)
        ReflectionUtils.newInstance(taskContext.getInputFormatClass(), job);
    // rebuild the input split
    org.apache.hadoop.mapreduce.InputSplit split = null;
    split = getSplitDetails(new Path(splitIndex.getSplitLocation()),
        splitIndex.getStartOffset());
    LOG.info("Processing split: " + split);

    org.apache.hadoop.mapreduce.RecordReader<INKEY,INVALUE> input =
      new NewTrackingRecordReader<INKEY,INVALUE>
        (split, inputFormat, reporter, taskContext);
    
    job.setBoolean(JobContext.SKIP_RECORDS, isSkipping());
    org.apache.hadoop.mapreduce.RecordWriter output = null;
    
    // get an output object
    if (job.getNumReduceTasks() == 0) {
      output = 
        new NewDirectOutputCollector(taskContext, job, umbilical, reporter);
    } else {
      output = new NewOutputCollector(taskContext, job, umbilical, reporter);
    }

    org.apache.hadoop.mapreduce.MapContext<INKEY, INVALUE, OUTKEY, OUTVALUE> 
    mapContext = 
      new MapContextImpl<INKEY, INVALUE, OUTKEY, OUTVALUE>(job, getTaskID(), 
          input, output, 
          committer, 
          reporter, split);

    org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE>.Context 
        mapperContext = 
          new WrappedMapper<INKEY, INVALUE, OUTKEY, OUTVALUE>().getMapContext(
              mapContext);

    try {
     //在这里，输入初始化 
      input.initialize(split, mapperContext);
      // 输入初始化完了，调用我们mapper的run方法，
      // 我们点进去这个run，会发现，这个run方法就是前面我们第一步的mapper的run方法
      // map方法开始处理
      mapper.run(mapperContext);
      // map处理完了，map阶段结束
      mapPhase.complete();
      setPhase(TaskStatus.Phase.SORT);
      statusUpdate(umbilical);
      //输入关闭
      input.close();
      input = null;
      //输出关闭，close方法里面其实给我们做了刷新
      output.close(mapperContext);
      output = null;
      		//到这里我们的map就做完了，宏观来讲
      		// 1. 准备输入
      		// 2. 调用map方法
      		// 3 .输入
    } finally {
      closeQuietly(input);
      closeQuietly(output, mapperContext);
    }
  }

现在我们往上看一下具体的细节

private <INKEY,INVALUE,OUTKEY,OUTVALUE>
  void runNewMapper(final JobConf job,
                    final TaskSplitIndex splitIndex,
                    final TaskUmbilicalProtocol umbilical,
                    TaskReporter reporter
                    ) throws IOException, ClassNotFoundException,
                             InterruptedException {
    // make a task context so we can get the classes
    // 	见名知意，任务的上下文，里面最主要的是包含了我们这个job
    org.apache.hadoop.mapreduce.TaskAttemptContext taskContext =
      new org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl(job, 
                                                                  getTaskID(),
                                                                  reporter);
    // make a mapper
    // 创建mapper，通过反射的方式创建， taskContext里面包含我们的job，
    // 我们的job包含我们的config配置， 现在我们是分布式，map肯定在客户端外的一台节点
    // 客户端把配置信息，切片信息，jar包等提交到hdfs，我们这个map task运行的时候，会把我们的
    // 配置信息，切片信息，jar包等下载到本地，这个时候，这个job的配置信息和我们客户端书写的配置
    // 信息是一样的，所以，我们客户端的job和异地的其他机器当中那些所有mao和reduce中的job是一个
    // 因为他们会来自同一个配置文件
    // 这里隐形的是取我们配置信息的mapperclass，我们进去看一下，如果用户配置了，就取用户的。
    // 我们mr开发，肯定会重写mapper，
    // 所以这里的意思就是根据我们用户在配置文件中指定的mapper类，反射出具体的mapper类 

    org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE> mapper =
      (org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE>)
        ReflectionUtils.newInstance(taskContext.getMapperClass(), job);
    // make the input format
	// 使用输入格式化，，客户端也用到了输入格式化，做了计算切片清单，
	// 我们看看输入格式化类在map做了哪些事情 
	// 通过反射获取输入格式化类，getInputFormatClass方法，里面是，如果配置了就取用户的
	// 如果用户没有配置，就返回默认的TextInputFormat
    org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE> inputFormat =
      (org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE>)
        ReflectionUtils.newInstance(taskContext.getInputFormatClass(), job);
    // rebuild the input split
    // 把切片清单重构成一个对象，因为在客户端把我们的切片信息序列化了，所以在map环节，要反序列化
    // 然后从里面找到自己的切片，把它变成一个对象，这个对象引用的是自己的切片对象
    org.apache.hadoop.mapreduce.InputSplit split = null;
    split = getSplitDetails(new Path(splitIndex.getSplitLocation()),
        splitIndex.getStartOffset());
    LOG.info("Processing split: " + split);

    org.apache.hadoop.mapreduce.RecordReader<INKEY,INVALUE> input =
    //创建一个记录读取器 ，切片和输入格式化传在里面，我们进到构造方法里面看看,在下面 
      new NewTrackingRecordReader<INKEY,INVALUE>
        (split, inputFormat, reporter, taskContext);
    
    job.setBoolean(JobContext.SKIP_RECORDS, isSkipping());
    org.apache.hadoop.mapreduce.RecordWriter output = null;
    
    // get an output object
    if (job.getNumReduceTasks() == 0) {
      output = 
        new NewDirectOutputCollector(taskContext, job, umbilical, reporter);
    } else {
      output = new NewOutputCollector(taskContext, job, umbilical, reporter);
    }

    org.apache.hadoop.mapreduce.MapContext<INKEY, INVALUE, OUTKEY, OUTVALUE> 
    mapContext = 
      new MapContextImpl<INKEY, INVALUE, OUTKEY, OUTVALUE>(job, getTaskID(), 
          input, output, 
          committer, 
          reporter, split);

    org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE>.Context 
        mapperContext = 
          new WrappedMapper<INKEY, INVALUE, OUTKEY, OUTVALUE>().getMapContext(
              mapContext);


			......
	}
}

这里是上面要看的NewTrackingRecordReader构造方法


NewTrackingRecordReader(org.apache.hadoop.mapreduce.InputSplit split,
        org.apache.hadoop.mapreduce.InputFormat<K, V> inputFormat,
        TaskReporter reporter,
        org.apache.hadoop.mapreduce.TaskAttemptContext taskContext)
        throws InterruptedException, IOException {
      this.reporter = reporter;
      this.inputRecordCounter = reporter
          .getCounter(TaskCounter.MAP_INPUT_RECORDS);
      this.fileInputByteCounter = reporter
          .getCounter(FileInputFormatCounter.BYTES_READ);

      List <Statistics> matchedStats = null;
      if (split instanceof org.apache.hadoop.mapreduce.lib.input.FileSplit) {
        matchedStats = getFsStatistics(((org.apache.hadoop.mapreduce.lib.input.FileSplit) split)
            .getPath(), taskContext.getConfiguration());
      }
      fsStats = matchedStats;

      long bytesInPrev = getInputBytes(fsStats);
      //自己的记录读取器，等于我们传进去的输入格式化给我们创建的记录读取器，
      // 在map阶段，输入格式化给我们输入的对象创建了一个记录读取器，我们进去看一下这个方法
      this.real = inputFormat.createRecordReader(split, taskContext);
      long bytesInCurr = getInputBytes(fsStats);
      fileInputByteCounter.increment(bytesInCurr - bytesInPrev);
    }

我们看一看输入格式化类是怎么给我们创建记录读取器的，因为我们的输入格式化类是上面反射生成的默认TextInputFormat，所以我们就进到这个方法的实现里面，这一步只是构造

 public RecordReader<LongWritable, Text> 
    createRecordReader(InputSplit split,
                       TaskAttemptContext context) {
             //从上下文中获取配置
    String delimiter = context.getConfiguration().get(
        "textinputformat.record.delimiter");
    byte[] recordDelimiterBytes = null;
    if (null != delimiter)
      recordDelimiterBytes = delimiter.getBytes(Charsets.UTF_8);
      //返回了一个行为单位的记录读取器，现在我们知道了，在map阶段，默认输入格式化类给我们new了一个
      // 行记录读取器，我们现在看一下是怎么读hdfs的数据
    return new LineRecordReader(recordDelimi，terBytes);
  }

现在我们返回上一步

NewTrackingRecordReader(org.apache.hadoop.mapreduce.InputSplit split,
        org.apache.hadoop.mapreduce.InputFormat<K, V> inputFormat,
        TaskReporter reporter,
        org.apache.hadoop.mapreduce.TaskAttemptContext taskContext)
        throws InterruptedException, IOException {
      this.reporter = reporter;
      this.inputRecordCounter = reporter
          .getCounter(TaskCounter.MAP_INPUT_RECORDS);
      this.fileInputByteCounter = reporter
          .getCounter(FileInputFormatCounter.BYTES_READ);

      List <Statistics> matchedStats = null;
      if (split instanceof org.apache.hadoop.mapreduce.lib.input.FileSplit) {
        matchedStats = getFsStatistics(((org.apache.hadoop.mapreduce.lib.input.FileSplit) split)
            .getPath(), taskContext.getConfiguration());
      }
      fsStats = matchedStats;

      long bytesInPrev = getInputBytes(fsStats);
      //自己的记录读取器，等于我们传进去的输入格式化给我们创建的记录读取器，
      // 在map阶段，输入格式化给我们输入的对象创建了一个记录读取器
     //	我们构造出来的LineRecordReader被这个对象的real引用了	那么肯定会用到这个LineRecordReader对象
     // 我们查看一下方法

      this.real = inputFormat.createRecordReader(split, taskContext);
      long bytesInCurr = getInputBytes(fsStats);
      fileInputByteCounter.increment(bytesInCurr - bytesInPrev);
    }

我们发现有几个认识的，nextKeyValue，.getCurrentKey，getCurrentValue，这些在我们一开始的mapper的run方法调用过这几个方法，可以去开头查看一下
在这里插入图片描述
我们先进到nextKeyValue看一下

  public boolean nextKeyValue() throws IOException, InterruptedException {
      long bytesInPrev = getInputBytes(fsStats);
      // 这里用到了我们的real，就是LineRecordReader
      // 然后调用的就是LineRecordReader的nextKeyValue来干活的
      boolean result = real.nextKeyValue();
      long bytesInCurr = getInputBytes(fsStats);
      if (result) {
        inputRecordCounter.increment(1);
      }
      fileInputByteCounter.increment(bytesInCurr - bytesInPrev);
      reporter.setProgress(getProgress());
      return result;
    }

我们往前退，因为这里只是一个构造的环节

private <INKEY,INVALUE,OUTKEY,OUTVALUE>
  void runNewMapper(final JobConf job,
                    final TaskSplitIndex splitIndex,
                    final TaskUmbilicalProtocol umbilical,
                    TaskReporter reporter
                    ) throws IOException, ClassNotFoundException,
                             InterruptedException {
    // make a task context so we can get the classes
    // 	见名知意，任务的上下文，里面最主要的是包含了我们这个job
    org.apache.hadoop.mapreduce.TaskAttemptContext taskContext =
      new org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl(job, 
                                                                  getTaskID(),
                                                                  reporter);
    // make a mapper
    // 创建mapper，通过反射的方式创建， taskContext里面包含我们的job，
    // 我们的job包含我们的config配置， 现在我们是分布式，map肯定在客户端外的一台节点
    // 客户端把配置信息，切片信息，jar包等提交到hdfs，我们这个map task运行的时候，会把我们的
    // 配置信息，切片信息，jar包等下载到本地，这个时候，这个job的配置信息和我们客户端书写的配置
    // 信息是一样的，所以，我们客户端的job和异地的其他机器当中那些所有mao和reduce中的job是一个
    // 因为他们会来自同一个配置文件
    // 这里隐形的是取我们配置信息的mapperclass，我们进去看一下，如果用户配置了，就取用户的。
    // 我们mr开发，肯定会重写mapper，
    // 所以这里的意思就是根据我们用户在配置文件中指定的mapper类，反射出具体的mapper类 

    org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE> mapper =
      (org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE>)
        ReflectionUtils.newInstance(taskContext.getMapperClass(), job);
    // make the input format
	// 使用输入格式化，，客户端也用到了输入格式化，做了计算切片清单，
	// 我们看看输入格式化类在map做了哪些事情 
	// 通过反射获取输入格式化类，getInputFormatClass方法，里面是，如果配置了就取用户的
	// 如果用户没有配置，就返回默认的TextInputFormat
    org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE> inputFormat =
      (org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE>)
        ReflectionUtils.newInstance(taskContext.getInputFormatClass(), job);
    // rebuild the input split
    // 把切片清单重构成一个对象，因为在客户端把我们的切片信息序列化了，所以在map环节，要反序列化
    // 然后从里面找到自己的切片，把它变成一个对象，这个对象引用的是自己的切片对象
    org.apache.hadoop.mapreduce.InputSplit split = null;
    split = getSplitDetails(new Path(splitIndex.getSplitLocation()),
        splitIndex.getStartOffset());
    LOG.info("Processing split: " + split);

    org.apache.hadoop.mapreduce.RecordReader<INKEY,INVALUE> input =
    //创建一个记录读取器 ，切片和输入格式化传在里面，刚才看的是这个类的构造方法，现在还没有运行
    // 我们往下看
      new NewTrackingRecordReader<INKEY,INVALUE>
        (split, inputFormat, reporter, taskContext);
    
    job.setBoolean(JobContext.SKIP_RECORDS, isSkipping());
    // 这里是输出的，现在先不看，，等输入分析完后再来分析输出
    org.apache.hadoop.mapreduce.RecordWriter output = null;
    
    // get an output object
    if (job.getNumReduceTasks() == 0) {
      output = 
        new NewDirectOutputCollector(taskContext, job, umbilical, reporter);
    } else {
      output = new NewOutputCollector(taskContext, job, umbilical, reporter);
    }

    org.apache.hadoop.mapreduce.MapContext<INKEY, INVALUE, OUTKEY, OUTVALUE> 
    mapContext = 
    // 创建map任务自己的上下文，把job，任务的id，输入，输出，切片传进去了，我们进去看一看构造
      new MapContextImpl<INKEY, INVALUE, OUTKEY, OUTVALUE>(job, getTaskID(), 
          input, output, 
          committer, 
          reporter, split);

    org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE>.Context 
        mapperContext = 
          new WrappedMapper<INKEY, INVALUE, OUTKEY, OUTVALUE>().getMapContext(
              mapContext);


			......
	}
}

 public MapContextImpl(Configuration conf, TaskAttemptID taskid,
 						//我们的input到这里就变成了reader，赋给了我们Contex上下文的reader了
                        RecordReader<KEYIN,VALUEIN> reader,
                        RecordWriter<KEYOUT,VALUEOUT> writer,
                        OutputCommitter committer,
                        StatusReporter reporter,
                        InputSplit split) {
    super(conf, taskid, writer, committer, reporter);
    // 这个reader也包装了一些方法，在下面
    this.reader = reader;
    this.split = split;
  }

在这里插入图片描述
我们进去nextKeyValue看一下

@Override
  public boolean nextKeyValue() throws IOException, InterruptedException {
  	//里面就是调用自己的，我们的这个reader是哪个，画张图，很简单
    return reader.nextKeyValue();
  }

在这里插入图片描述
这些我们懂了之后，退回上一步，我们来分析这里


private <INKEY,INVALUE,OUTKEY,OUTVALUE>
  void runNewMapper(final JobConf job,
                    final TaskSplitIndex splitIndex,
                    final TaskUmbilicalProtocol umbilical,
                    TaskReporter reporter
                    ) throws IOException, ClassNotFoundException,
                             InterruptedException {

 ...... 

					try {
	//这里调用的是NewTrackingRecordReader的initialize方法，
	//NewTrackingRecordReader调用的是LineRecordReader的initialize方法，
	//这里就不看NewTrackingRecordReader，我们直接来看LineRecordReader的initialize方法
						//对输入格式化做初始化
						//map输入的环节，最重要的就是输入初始化的环节，
			//因为这个解释了分布式运行的时候，每一个并行的，分开单独运行的程序它们是怎么和数据交互的
					      input.initialize(split, mapperContext);
					      mapper.run(mapperContext);
					      mapPhase.complete();
					      setPhase(TaskStatus.Phase.SORT);
					      statusUpdate(umbilical);
					      input.close();
					      input = null;
					      output.close(mapperContext);
					      output = null;
					    } finally {
					      closeQuietly(input);
					      closeQuietly(output, mapperContext);
					    }
					  }
		
		}

接下来我们进入LineRecordReader的initialize方法，最终干活的也就是LineRecordReader这个类


public void initialize(InputSplit genericSplit,
                         TaskAttemptContext context) throws IOException {
      //切片
    FileSplit split = (FileSplit) genericSplit;
    //获取配置
    Configuration job = context.getConfiguration();
    this.maxLineLength = job.getInt(MAX_LINE_LENGTH, Integer.MAX_VALUE);
    // 从切片当中得到起始偏移量
    start = split.getStart();
    // 这个切片的结束偏移量  （起始偏移量加上长度）
    end = start + split.getLength();
    //从切片中得到了路径，也就是哪个文件
    final Path file = split.getPath();

    // open the file and seek to the start of the split
    
    // 通过job得到分布式文件对象
    final FileSystem fs = file.getFileSystem(job);
    //拿到分布式文件对象后打开这个文件，，，这个是hdfs的api，打开了一个输入流FSDataInputStream
    fileIn = fs.open(file);
    
    CompressionCodec codec = new CompressionCodecFactory(job).getCodec(file);
    if (null!=codec) {
      isCompressedInput = true;	
      decompressor = CodecPool.getDecompressor(codec);
      if (codec instanceof SplittableCompressionCodec) {
        final SplitCompressionInputStream cIn =
          ((SplittableCompressionCodec)codec).createInputStream(
            fileIn, decompressor, start, end,
            SplittableCompressionCodec.READ_MODE.BYBLOCK);
        in = new CompressedSplitLineReader(cIn, job,
            this.recordDelimiterBytes);
        start = cIn.getAdjustedStart();
        end = cIn.getAdjustedEnd();
        filePosition = cIn;
      } else {
        in = new SplitLineReader(codec.createInputStream(fileIn,
            decompressor), job, this.recordDelimiterBytes);
        filePosition = fileIn;
      }
    } else {
      // 如果读取的不是第一个切片，就seek到自己切片的起始偏移量，这样每个切片读就不会读别的数据
      fileIn.seek(start);
      //把fileIn包装到in，这个in就是真正可以读取文件的流
      in = new UncompressedSplitLineReader(
          fileIn, job, this.recordDelimiterBytes, split.getLength());
      filePosition = fileIn;
    }
    // If this is not the first split, we always throw away first record
    // because we always (except the last split) read one extra line in
    // next() method.
    
    // 只有除了第一个切片，后续所有切片对应的map，他们都能执行到if里
    if (start != 0) {
    		// //readLine读一行，有很多重载方式，这里的是读一行
    	 //  返回读到的字节然后赋值到承载的对象里，这个对象就能用了，这个承载的对象 
      //在java中是一个匿名对象，MapContext无法获得，这一行数据就作废了，
	//这里又+=,意思是起始的偏移量往后一行
	
  //从第二个切片开始，他们都不处理第一行，也可以说从第一个切片到倒数最后一个切片，他们都会多处理一行
     // 偏移量被修改为第二行的偏移量
      start += in.readLine(new Text(), 0, maxBytesToConsume(start));
    }
    //赋给了一个变量，pos
    this.pos = start;
  }

整个看这个方法是输入的初始化环节，初始化的时候会将这个切片的位置调整到第二行，因为hdfs的时候，一个文件上传被切块儿，它是绝对按字节数切的，这个时候不会考虑编码的问题，单词等等的问题。。。所以框架为了不破坏数据的完整性，每一个切片工作之前在初始化的环节，将切片这个位置的偏移量下移一行，这样第二个切片就不会处理第一行，这个时候这个切片就有一行没有被处理，这一行应该由上一个切片来处理，所以一个计算程序，最终是计算自己这个块的所有数据和下一个块起始的那一行数据。一个map程序，计算自己的块，肯定能达到本地化，但是下一行数据，就未必能本地化，因为如果自己的节点刚好存有那个块的副本就可以本地化，如果没有存，那么就只能通过网络拉取了，不过map是并行的，每个都只是拉取小量的数据，也不大，可以接受。

到这里，我们的初始化也就完成了，总结一下，在MapTast启动的时候，它的输入初始化操作，准备了一个对hdfs这个切片代表这个文件的流，也就是开启一个hdfs输入流，然后将偏移量seek到自己切片的位置，从第二个切片对应的map开始，它们都各自把自己的第一行读一下，算出字节数，加到自己切片上，把自己切片的偏移量向下移动了一行。因为hdfs在切割块的时候，有可能切割在上下两个块的尾和头，那么这个时候要将下一个块的头还给上一个块处理。这些操作就是在规避数据不完整的问题

当初始化完成之后，我们往前退

private <INKEY,INVALUE,OUTKEY,OUTVALUE>
  void runNewMapper(final JobConf job,
                    final TaskSplitIndex splitIndex,
                    final TaskUmbilicalProtocol umbilical,
                    TaskReporter reporter
                    ) throws IOException, ClassNotFoundException,
                             InterruptedException {

 ...... 

					try {
	//这里调用的是NewTrackingRecordReader的initialize方法，
	//NewTrackingRecordReader调用的是LineRecordReader的initialize方法，
	//这里就不看NewTrackingRecordReader，我们直接来看LineRecordReader的initialize方法
						//对输入格式化做初始化
						//map输入的环节，最重要的就是输入初始化的环节，
			//因为这个解释了分布式运行的时候，每一个并行的，分开单独运行的程序它们是怎么和数据交互的
					      input.initialize(split, mapperContext);
	 	//调用run方法，这个方法前面有，这里直接看最终工作的LineRecordReader的nextKeyValue方法
					      mapper.run(mapperContext);
					      mapPhase.complete();
					      setPhase(TaskStatus.Phase.SORT);
					      statusUpdate(umbilical);
					      input.close();
					      input = null;
					      output.close(mapperContext);
					      output = null;
					    } finally {
					      closeQuietly(input);
					      closeQuietly(output, mapperContext);
					    }
					  }
		
		}

我们进入LineRecordReader的nextKeyValue方法


public boolean nextKeyValue() throws IOException {
    if (key == null) {
    	//如果key为空，则定义LongWritable类型
      key = new LongWritable();
    }
    //赋值pos，pos就是我们上面算的偏移量
    // key里面放的就是行的偏移量
    key.set(pos);
    if (value == null) {
    //value默认是我们的Text类型
      value = new Text();
    }
    int newSize = 0;
    // We always read one extra line, which lies outside the upper
    // split limit i.e. (end - 1)
    while (getFilePosition() <= end || in.needAdditionalRecordAfterSplit()) {
      if (pos == 0) {
        newSize = skipUtfByteOrderMark();
      } else {
      //初始化完了之后在读，读的就是原有切片的第二行，把第二行赋给value
        newSize = in.readLine(value, maxLineLength, maxBytesToConsume(pos));
        // 更新pos，下次pos就是第三行
        pos += newSize;
      }

      if ((newSize == 0) || (newSize < maxLineLength)) {
        break;
      }

      // line too long. try again
      LOG.info("Skipped line of size " + newSize + " at pos " + 
               (pos - newSize));
    }
    if (newSize == 0) {
      key = null;
      value = null;
      return false;
    } else {
      return true;
    }
  }

map运行调用nextKeyValue，最核心的就是为key和value赋值，接下来map的run方法还会调用两个方法，getCurrentKey，和getCurrentValue，这两个就是把nextKeyValue赋的值给获取了一下，这样map就能直接计算了。这个就是输入环节，一行一行输入就行了，，，，到这里，我们的map输入环节就以及分析完成！！


	//run方法，我们的Mapper方法运行的话，一定会调用这个run方法，这里就是入口
public void run(Context context) throws IOException, InterruptedException {
    //运行前把我们的Context上下文放到里面，Context会包含我们的输入和输出
	setup(context);
    try {
    //循环
      while (context.nextKeyValue()) {
      //我们之间进去map方法
        map(context.getCurrentKey(), context.getCurrentValue(), context);
      }
    } finally {
      cleanup(context);
    }
  }

public LongWritable getCurrentKey() {
    return key;
  }

   public Text getCurrentValue() {
    return value;
  }

慢慢变成大佬

发布了3 篇原创文章 · 获赞 4 · 访问量 1778

私信关注

MapReduce Map Task Input源码分析（小白只要认真看，绝对都能绝对可以自己独立走一遍源码，并对MapReduce Map的Input有底层实现的了解）

Map Input核心源码分析

猜你喜欢

MapReduce Map Task Input源码分析 （小白只要认真看，绝对都能绝对可以自己独立走一遍源码，并对MapReduce Map的Input有底层实现的了解）

Map Input核心源码分析

猜你喜欢

MapReduce Map Task Input源码分析（小白只要认真看，绝对都能绝对可以自己独立走一遍源码，并对MapReduce Map的Input有底层实现的了解）