Hadoop entry case

Hadoop's running process:

  1. Clients request file storage from HDFS or compute using MapReduce.
  2. NameNode is responsible for managing all data blocks and metadata information in the entire HDFS system; DataNode actually stores and manages data blocks. The client uses the NameNode to find the DataNode where the file to be accessed or processed is located, and sends the operation request to the corresponding DataNode.
  3. When a client uploads a new file (such as inputting some logs), it will be divided into a fixed size (64MB by default) and the data will be copied to ensure reliability. These small segments called "blocks" can be combined with other hardware to achieve the minimum cost and the best speed/fault tolerance balance to show extremely high efficiency, and use checksum to mark whether each instance is consistent on the disk.
  4. The data has been saved in the local node storage space location, and then enters the initial mapper/reducer task pipeline processing stage. (Mapper tasks usually need to read some of the input records and generate output key-value pairs to the reducer; the second reducer task reads the evaluator output bytecode mapping subset associated with it, pay attention to writing to different units).
  5. After the Mapper generates intermediate results, the results are sorted according to the key and redistributed to the next level of the stage pipeline to start the reducer work until the final output result is calculated.
  6. There are two copies of the output file by default.

simple case

It should be noted that this process depends on the MapReduce execution mode.

  1. Preparation

Install the Hadoop cluster and transfer the data files to HDFS.

  1. Write the Mapper class

Each Mapper class is responsible for parsing the input and producing key/value pairs. In this simple case, we've read the entire file into memory and split its lines:

public class WordCountMapper extends Mapper<Object, Text, Text, IntWritable>{
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
        String[] words = value.toString().split("\\s+");
        for(String w : words){
            word.set(w);
            context.write(word,one);
        }
    }
}
  1. Write the Reducer class

Reducer is another key part in this case. It accepts custom types Text and Iterable as input, and generates output of the same type to avoid huge memory hit and I/O related problems:

public class WordCountReducer extends Reducer<Text,IntWritable,Text,IntWritable> {
   private TreeMap<Integer,List<String>> sortedMap;

   protected void setup(Context context) throws IOException,
       InterruptedException
      {
         sortedMap = new TreeMap<Integer,List<String>>(Collections.reverseOrder());
      }

     public void reduce(Text key , Iterable<IntWritable> values , Context
         context) throws IOException , InterruptedException{

          int sum = 0;
          for(IntWritable value : values){
              sum += value.get();
          }
          List<String> keys = sortedMap.get(sum);
          if(keys == null)
             keys = new ArrayList<String>();

         keys.add(key.toString());
         sortedMap.put(sum,keys);

      }

    protected void cleanup(Context context) throws IOException,
       InterruptedException
     {
        Set<Map.Entry<Integer,List<String>>> entrySet = sortedMap.entrySet();

        int counter = 0;
        for(Map.Entry<Integer,List<String>> entry : entrySet){

            Integer key = entry.getKey();
            List<String> values = entry.getValue();

            for(String val: values){
              if(counter++ == 10)
                  break;
                context.write(new Text(val), new IntWritable(key));
           }
       }
   }
}
  1. configure the job and run

Configures the driver for a Hadoop job that writes output to HDFS via the DelegatingMapper, IdentityReducer, and LazyOutputFormat methods.

Set up the job and pass the input/output paths in the main() function:

Job job=Job.getInstance(getConf(),"word count");
job.setJarByClass(WordCount.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

job.setMapperClass(WordCountMapper.class);
//set combiner class reducer to optimize performance
job.setCombinerClass(WordCountReducer.class);
//num of reducers is set as one only here since output size is very small.
 job.setNumReduceTasks(1);

LazyOutputFormat.setOutputFormatClass(job,TextOutputFormat.class); //write directoy instead of file.

FileInputFormat.addInputPath(job,new Path(args[0]));
FileOutputFormat.setOutputPath(job,new Path(args[1]));

System.exit(job.waitForCompletion(true)? 0 : 1);
  1. run job

Upload the compiled jar file to the Hadoop cluster, and execute the following command to run the MapReduce job:

hadoop jar wordcount.jar /input /output

This simple case does the trick, it reads in a text file and counts the 10 most frequently occurring words.

Guess you like

Origin blog.csdn.net/weixin_43760048/article/details/130005711