MapReduce secondary sorting

     By default, the result output by Map will sort the Key by default, but sometimes it is necessary to sort the Key and then sort the Value at the same time. In this case, the secondary sorting is used. Let us introduce what secondary sorting is.

The principle of secondary sorting
        We divide the secondary sorting into the following stages.
Map Initial Stage
        In the Map stage, the InputFormat defined by job.setInputFormatClass() is used to divide the input data set into small data blocks split, and InputFormat provides an implementation of RecordReader. The TextInputFormat is used in this course, and the RecordReader it provides will take the line number of the text as the Key and the text of this line as the Value. That's why the input of the custom Mapper is <LongWritable,Text>. Then call the map method of the custom Mapper, and input each <LongWritable,Text> key-value pair to the map method of the Mapper.
The last stage
        of Map At the end of the Map stage, job.setPartitionerClass() will be called to partition the output of the Mapper, and each partition is mapped to a Reducer. In each partition, the Key comparison function class set by job.setSortComparatorClass() is called for sorting. As you can see, this is itself a secondary sort. If the Key comparison function class is not set through job.setSortComparatorClass(), the compareTo() method implemented by Key is used. We can either use the compareTo() method implemented by IntPair, or we can specifically define the Key comparison function class.
Reduce Phase
        In the Reduce phase, the reduce() method will also call the Key comparison function class set by the job.setSortComparatorClass() method after accepting all the map outputs mapped to the Reduce to sort all the data. Then start constructing a Value iterator corresponding to Key. At this time, grouping is used, and the grouping function class is set using the job.setGroupingComparatorClass() method. As long as the two Keys compared by this comparator are the same, they belong to the same group, and their Values ​​are placed in a Value iterator, and the Key of this iterator uses the first Key of all Keys that belong to the same group. The last step is to enter the reduce() method of the Reducer. The input of the reduce() method is all Keys and its Value iterators. Also note that the types of input and output must be the same as those declared in the custom Reducer.

Next, we can intuitively understand the principle of secondary sorting through data examples.

The content of the input file sort.txt (download) is:

40 20
40 10
40 30
40 5
30 30
30 20
30 10
30 40
50 20
50 50
50 10
50 60
        The content of the output file (sorted from small to large) is as follows:

30 10
30 20
30 30
30 40
===============================
40 5
40 10
40 20
40 30
======== ======================= 
50 10
50 20
50 50
50 60

The specific process of secondary sorting
        In MapReduce, all keys need to be compared and sorted , and it is secondary, first according to Partitioner, and then according to size. In this case, the comparison is performed twice. Sort by the first field first, and then sort by the second field if the first field is the same. Based on this, we can construct a composite class IntPair with two fields, first using partition to sort the first field, and then using the comparison within the partition to sort the second field.

Code Implementation
        Hadoop's example package comes with a MapReduce secondary sorting algorithm. The following example improves the secondary sorting source code in the example package. We complete the secondary sorting according to the following steps:

        Step 1: Customize the IntPair class, encapsulate the key/value in the sample data as a whole as the Key, and implement the WritableComparable interface and rewrite its methods.
/**
* The key class defined by yourself should implement the WritableComparable interface
*/
public  class IntPair implements WritableComparable<IntPair>{
	int first;//The first member variable
	int second;//The second member variable
	public void set(int left, int right){
		first = left;
		second = right;
	}
	public int getFirst(){
		return first;
	}
	public int getSecond(){
		return second;
	}
	@Override
	//Deserialize, convert from binary in stream to IntPair
	public void readFields(DataInput in) throws IOException{
		first = in.readInt();
		second = in.readInt();
	}
	@Override
	//Serialize, convert IntPair to binary using streaming
	public void write(DataOutput out) throws IOException{
		out.writeInt(first);
		out.writeInt(second);
	}
	@Override
	//key comparison
	public int compareTo(IntPair o)
	{
		// TODO Auto-generated method stub
		if (first != o.first){
			return first < o.first ? -1 : 1;
		}else if (second != o.second){
			return second < o.second ? -1 : 1;
		}else{
			return 0;
		}
	}
	
	@Override
	public int hashCode(){
		return first * 157 + second;
	}
	@Override
	public boolean equals(Object right){
		if (right == null)
			return false;
		if (this == right)
			return true;
		if (right instanceof IntPair){
			IntPair r = (IntPair) right;
			return r.first == first && r.second == second;
		}else{
			return false;
		}
	}
}


Step 2: Customize the partition function class FirstPartitioner, and implement the partition according to the first in IntPair.

Step 3: Customize SortComparator to implement the first and second sorting in the IntPair class. This method is not used this time, but is implemented using the compareTo() method in IntPair.

Step 4: Customize the GroupingComparator class to implement data grouping within the partition.
/**
* inherits WritableComparator
*/
public static class GroupingComparator extends WritableComparator{
        protected GroupingComparator(){
            super(IntPair.class, true);
        }
        @Override
        //Compare two WritableComparables.
        public int compare(WritableComparable w1, WritableComparable w2){
            IntPair ip1 = (IntPair) w1;
            IntPair ip2 = (IntPair) w2;
            int l = ip1.getFirst();
            int r = ip2.getFirst();
            return l == r ? 0 : (l < r ? -1 : 1);
        }
}


  Step 5: Write the MapReduce main program to implement secondary sorting.
public class SecondarySort{
    // custom map
    public static class Map extends Mapper<LongWritable, Text, IntPair, IntWritable>{
        private final IntPair intkey = new IntPair();
        private final IntWritable intvalue = new IntWritable();
        public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{
            String line = value.toString();
            StringTokenizer tokenizer = new StringTokenizer(line);
            int left = 0;
            int right = 0;
            if (tokenizer.hasMoreTokens()){
                left = Integer.parseInt(tokenizer.nextToken());
                if (tokenizer.hasMoreTokens())
                    right = Integer.parseInt(tokenizer.nextToken());
                intkey.set(left, right);
                intvalue.set(right);
                context.write(intkey, intvalue);
            }
        }
    }
    // custom reduce
    public static class Reduce extends Reducer< IntPair, IntWritable, Text, IntWritable>{
        private final Text left = new Text();      
        public void reduce(IntPair key, Iterable< IntWritable> values,Context context) throws IOException, InterruptedException{
            left.set(Integer.toString(key.getFirst()));
            for (IntWritable val : values){
                context.write(left, val);
            }
        }
    }
    /**
     * @param args
     */
    public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException{
        // TODO Auto-generated method stub
        Configuration conf = new Configuration();

        Job job = new Job(conf, "secondarysort");
        job.setJarByClass(SecondarySort.class);
        
        FileInputFormat.setInputPaths(job, new Path(args[0]));//input path
        FileOutputFormat.setOutputPath(job, new Path(args[1]));//Output path

        job.setMapperClass(Map.class);// Mapper
        job.setReducerClass(Reduce.class);// Reducer
        
        job.setPartitionerClass(FirstPartitioner.class);// Partition function
        //job.setSortComparatorClass(KeyComparator.Class);//This course does not customize SortComparator, but uses the sorting that comes with IntPair
        job.setGroupingComparatorClass(GroupingComparator.class);// Grouping function


        job.setMapOutputKeyClass(IntPair.class);
        job.setMapOutputValueClass(IntWritable.class);
        
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);
       
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326942364&siteId=291194637