MapReduce中
的InputFormat
1.源码
package org.apache.hadoop.mapreduce;
import java.io.IOException;
import java.util.List;
import org.apache.hadoop.classification.InterfaceAudience;
import org.apache.hadoop.classification.InterfaceStability;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
InputFormat describes the input-specification for a Map-Reduce job.
The Map-Reduce framework relies on the InputFormat of the job to:
1.Validate the input-specification of the job.
2.Split-up the input file(s) into logical InputSplits, each of which is then assigned to an
individual Mapper.
3.Provide the RecordReader implementation to be used to glean input records from the logical
InputSplit for processing by the Mapper.
The default behavior of file-based InputFormats, typically sub-classes of FileInputFormat,
is to split the input into logical InputSplits based on the total size, in bytes, of the input
files. However, the FileSystem blocksize of the input files is treated as an upper bound for
input splits. A lower bound on the split size can be set via
mapreduce.input.fileinputformat.split.minsize
Clearly, logical splits based on input-size is insufficient for many
applications since record boundaries are to respected. In such cases, the
application has to also implement a RecordReader on whom lies the
responsibility to respect record-boundaries and present a record-oriented
view of the logical InputSplit to the individual task.
@see InputSplit
@see RecordReader
@see FileInputFormat
@InterfaceAudience.Public
@InterfaceStability.Stable
public abstract class InputFormat<K, V> {
...
}
2.简介
InputFormat 规定了MapReduce框架中map任务的输入规范。
Map-Reduce框架上运行的任务在下面几点依赖于InputFormat:
- 1.验证作业的输入规范
- 2.将(物理)输入文件分割成逻辑输入文件,每个逻辑文件相应的被分发到各自的Mapper上。
- 3.为了被Mapper处理,提供RecordReader实现为了从逻辑输入块中收集输入记录
InputFormat有两个抽象方法,分别是:getSplits
和createRecordReader
方法
- getSplits
Logically split the set of input files for the job.
#将作业的整个输入文件逻辑分解
Each InputSplit is then assigned to an individual Mapper for processing
#每个输入切片被分发到各自的Mapper上处理。【如何分发?】
Note: The split is a logical split of the inputs and the input files are not physically
split into chunks.
For e.g. a split could be <input-file-path, start, offset> tuple. The InputFormat also
creates the RecordReader to read the InputSplit.
#注释:分割是输入的逻辑分割,而输入文件不会被物理分割成块。
例如,一个切片可以是<输入文件,起始位置,偏移量>这样的一个三月组。同时,InputFormat会创建一个RecordReader
去读取它(InputSplit)。【这就是下面所介绍的第二个抽象方法】
@param context job configuration.
@return an array of InputSplits for the job.
public abstract
List<InputSplit> getSplits(JobContext context
) throws IOException, InterruptedException;
- createRecordReader
Create a record reader for a given split. The framework will call
RecordReader.initialize(InputSplit, TaskAttemptContext) before the split is used.
#为每一个(逻辑)切片创建一个record reader(记录阅读器)。map-reduce框架将会在切片被使用之前调用
Record.initilize(InputSplit,TaskAttemptContext)方法。
@param split: the split to be read
@param context: the information about the task
@return: a new record reader
@throws IOException
@throws InterruptedException
public abstract
RecordReader<K,V> createRecordReader(InputSplit split,
TaskAttemptContext context
) throws IOException,
InterruptedException;
}
3.总结
- 01.job.setInputFormatClass(xxxx.class);这里的xxx必须是继承自InputFormat的类。
- 02.所有的输入格式类都继承自InputFormat类,这是一个抽象类。
- 03.不同的InputFormat会各自实现不同的文件读取方式以及分片方式,每个输入分片会被单独的map task作为数据源
- 04.Mappers的输入是一个一个的输入分片,称InputSplit。