MapReduce parallelism mechanism

MapTask parallelism refers to the map stage how many parallel processing tasks common task. tasking parallelism map stage, it is bound to affect the processing speed of the entire job. So, MapTask parallel instance is it better? Its degree of parallelism is how to determine it?

A MapReducejob the map parallelism phase by the client to submit job determined time , i.e. the client to submit job will be treated prior to the data logic chips . To complete the formation of the slice sections planning file ( job.split ) , each logical start slice corresponding to a final MapTask .

A logic microtome FileInputFormat implementation class getSplits () method is completed.

FileInputFormat slicing mechanism

FileInputFormat default slice mechanisms:

A. simply slicing according to the contents of the file length

B. slice size, equal to the default block size

C. irrespective of when the data set as a whole slice, slice by slice, but separately for each file

For example there are two data files to be processed:

file1.txt 320M

file2.txt 10M  

After FileInputFormat the operation microtome, the slice information formed as follows:

file1.txt.split1—0M~128M

file1.txt.split2—128M~256M

file1.txt.split3—256M~320M

file2.txt.split1—0M~10M

FileInputFormat sized slice parameters

In FileInputFormat , the slice size calculation logic:

Math.max(minSize, Math.min(maxSize, blockSize)); 

Slice operation mainly determined by these values:

minSize : Default: 1

Configuration parameters: mapreduce.input.fileinputformat.split.minsize

MAXSIZE : defaults: Long.MAXValue

Configuration parameters: mapreduce.input.fileinputformat.split.maxsize

blocksize

Therefore , by default, Split size = Block size , in hadoop 2.x for 128M .

MAXSIZE (slice maximum): if the transfer parameters than blocksize small, then let slice becomes small, and this is equivalent to the configuration parameters.

minSize (sliced minimum): Parameter tone than blockSize big, you can have a slice becomes larger than blocksize larger.

However, no matter how tone parameters do not allow multiple small files "classified" A Split .

There is a detail:

When bytesRemaining / splitSize> 1.1 are not satisfied, then finally all the remaining will be used as a slice. For example, so as not to form a 129M file into two slices planning situation



Guess you like

Origin www.cnblogs.com/TiePiHeTao/p/0eb21b7deab3eaeae9d1df5184b066f2.html