Data sorting of MapReduce cases

1        Data sorting

1.1               Data sorting

Sort the data in the input file. Each line in the input file is a number, a piece of data. It is required that each line in the output has two numbers with intervals, where the first represents the rank of the original data in the original data set, and the second represents the original data.

1.2               Application scenarios

" Data sorting " is the first job to be completed when many practical tasks are performed, such as student grades, data indexing, etc. This example is similar to data deduplication, in which the original data is initially processed to lay a solid foundation for further data operations.

1.3               Design Ideas

  This example only requires sorting the input data. Readers who are familiar with the MapReduce process will quickly think that there is sorting in the MapReduce process. Is it possible to use this default sorting without implementing specific sorting? The answer is yes.

  But you need to understand its default collation first before using it. It is sorted according to the key value. If the key is an IntWritable type encapsulating an int , then MapReduce sorts the keys according to the number size . If the key is a Text type encapsulating a String , then MapReduce sorts the strings in lexicographical order.

  Knowing this detail, we know that we should use the IntWritable data structure that encapsulates int . That is , convert the read data into IntWritable type in the map , and then output it as the key value ( value is arbitrary). After reduce gets the <key , value-list> , it outputs the input key as the value , and determines the number of times of output according to the number of elements in the value-list . The output key (ie linenum in the code ) is a global variable that counts the rank of the current key . It should be noted that the combiner is not configured in this program , that is, the combiner is not used in the MapReduce process . This is mainly because the task can already be done using map and reduce .

1.4               Program code

    The program code is as follows:

 importjava.io.IOException; 

 importorg.apache.hadoop.conf.Configuration; 

importorg.apache.hadoop.fs.Path; 

import org.apache.hadoop.io.IntWritable; 

importorg.apache.hadoop.io.Text; 

importorg.apache.hadoop.mapreduce.Job; 

importorg.apache.hadoop.mapreduce.Mapper; 

importorg.apache.hadoop.mapreduce.Reducer; 

importorg.apache.hadoop.mapreduce.lib.input.FileInputFormat; 

importorg.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 

importorg.apache.hadoop.util.GenericOptionsParser; 

 public class Sort {  

     //map converts the value in the input into an IntWritable type as the output key

    public static class Map extends    

        Mapper<Object,Text,IntWritable,IntWritable>{

        privatestaticIntWritabledata=newIntWritable();    

        // implement the map function

        publicvoidmap(Object key,Text value,Context context)  

                throwsIOException,InterruptedException{ 

            String line=value.toString();

            data.set(Integer.parseInt(line));

            context.write(data, newIntWritable(1)); 

        }

    }

    //reduce copies the key in the input to the key in the output data ,

    //然后根据输入的value-list中元素的个数决定key的输出次数

    //用全局linenum来代表key的位次

    public static class Reduce extends

            Reducer<IntWritable,IntWritable,IntWritable,IntWritable>{

        private static IntWritable linenum = new IntWritable(1);

        //实现reduce函数

        public void reduce(IntWritable key,Iterable<IntWritable> values,Context context)

                throws IOException,InterruptedException{

            for(IntWritable val:values){

                context.write(linenum, key);

                linenum = new IntWritable(linenum.get()+1);

            }

        }

    }

    public static void main(String[] args) throws Exception{

        Configuration conf = new Configuration();

        conf.set("mapred.job.tracker", "192.168.1.2:9001");

        String[] ioArgs=new String[]{"sort_in","sort_out"};

     String[] otherArgs = new GenericOptionsParser(conf, ioArgs).getRemainingArgs();

     if (otherArgs.length != 2) {

     System.err.println("Usage: Data Sort <in> <out>");

         System.exit(2);

     }

     Job job = new Job(conf, "Data Sort");

     job.setJarByClass(Sort.class);

     //设置MapReduce处理类

     job.setMapperClass(Map.class);

     job.setReducerClass(Reduce.class);

     //设置输出类型

     job.setOutputKeyClass(IntWritable.class);

     job.setOutputValueClass(IntWritable.class);

     //设置输入和输出目录

     FileInputFormat.addInputPath(job, new Path(otherArgs[0]));

     FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

     System.exit(job.waitForCompletion(true) ? 0 : 1);

     }

}

 

 

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=327042648&siteId=291194637