MapReduce中的InputFormat

MapReduce中InputFormat

1.源码

package org.apache.hadoop.mapreduce;

import java.io.IOException;
import java.util.List;

import org.apache.hadoop.classification.InterfaceAudience;
import org.apache.hadoop.classification.InterfaceStability;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;


InputFormat describes the input-specification for a Map-Reduce job. 

The Map-Reduce framework relies on the InputFormat of the job to:
1.Validate the input-specification of the job. 
2.Split-up the input file(s) into logical InputSplits, each of which is then assigned to an 
individual  Mapper.
3.Provide the RecordReader implementation to be used to glean input records from the logical 
InputSplit for processing by the Mapper.

The default behavior of file-based  InputFormats, typically sub-classes of FileInputFormat, 
is to split the input into logical InputSplits based on the total size, in bytes, of the input 
files. However, the FileSystem blocksize of the input files is treated as an upper bound for 
input splits. A lower bound on the split size can be set via 
mapreduce.input.fileinputformat.split.minsize

Clearly, logical splits based on input-size is insufficient for many 
applications since record boundaries are to respected. In such cases, the
application has to also implement a RecordReader on whom lies the
responsibility to respect record-boundaries and present a record-oriented
view of the logical InputSplit to the individual task.

@see InputSplit
@see RecordReader
@see FileInputFormat

@InterfaceAudience.Public
@InterfaceStability.Stable
public abstract class InputFormat<K, V> {
...
}

2.简介

InputFormat 规定了MapReduce框架中map任务的输入规范。

Map-Reduce框架上运行的任务在下面几点依赖于InputFormat:

  • 1.验证作业的输入规范
  • 2.将(物理)输入文件分割成逻辑输入文件,每个逻辑文件相应的被分发到各自的Mapper上。
  • 3.为了被Mapper处理,提供RecordReader实现为了从逻辑输入块中收集输入记录

InputFormat有两个抽象方法,分别是:getSplitscreateRecordReader方法

  • getSplits
Logically split the set of input files for the job.
#将作业的整个输入文件逻辑分解

Each InputSplit is then assigned to an individual  Mapper for processing
#每个输入切片被分发到各自的Mapper上处理。【如何分发?】

Note: The split is a logical split of the inputs and the input files are not physically
split into chunks.
For e.g. a split could be <input-file-path, start, offset> tuple. The InputFormat also
creates the RecordReader to read the InputSplit.
#注释:分割是输入的逻辑分割,而输入文件不会被物理分割成块。
例如,一个切片可以是<输入文件,起始位置,偏移量>这样的一个三月组。同时,InputFormat会创建一个RecordReader
去读取它(InputSplit)。【这就是下面所介绍的第二个抽象方法】

@param context job configuration.
@return an array of InputSplits for the job.

  public abstract 
    List<InputSplit> getSplits(JobContext context
                               ) throws IOException, InterruptedException;
  • createRecordReader
Create a record reader for a given split. The framework will call 
RecordReader.initialize(InputSplit, TaskAttemptContext) before the split is used.
#为每一个(逻辑)切片创建一个record reader(记录阅读器)。map-reduce框架将会在切片被使用之前调用
Record.initilize(InputSplit,TaskAttemptContext)方法。

@param split: the split to be read
@param context: the information about the task
@return: a new record reader
@throws IOException
@throws InterruptedException
  public abstract 
    RecordReader<K,V> createRecordReader(InputSplit split,
                                         TaskAttemptContext context
                                        ) throws IOException, 
                                                 InterruptedException;

}

3.总结

  • 01.job.setInputFormatClass(xxxx.class);这里的xxx必须是继承自InputFormat的类。
  • 02.所有的输入格式类都继承自InputFormat类,这是一个抽象类。
  • 03.不同的InputFormat会各自实现不同的文件读取方式以及分片方式,每个输入分片会被单独的map task作为数据源
  • 04.Mappers的输入是一个一个的输入分片,称InputSplit。

猜你喜欢

转载自blog.csdn.net/liu16659/article/details/80789499