Basics of MapReduce Programming

(1) Basic MapReduceprogramming to realize word frequency statistics.

①In /user/hadoop/inputthe folder (the folder is empty), create a file wordfile1.txtand wordfile2.txtupload it to HDFSthe inputfolder.
The content of the file wordfile1.txtis as follows: The content of
I love Spark
I love Hadoop
the file wordfile2.txtis as follows:
Hadoop is good
Spark is fast
②Start Eclipse, after starting, the interface as shown in the figure below will pop up, prompting to set the workspace ( workspace). You can directly adopt the default setting " /home/hadoop/workspace" and click the " OK" button. It can be seen that since the current hadoopuser is used to log in to Linuxthe system, the default workspace directory is located under hadoopthe user directory " /home/hadoop".
insert image description here
EclipseAfter starting, select the “ File–>New–>Java Project” menu to start creating a Javaproject.
insert image description here
④Input Project namethe project name " " behind " WordCount", select " ", and let all the files of Use default locationthis project be saved in the " " directory. In the " " tab, you can choose what has been installed in the current system , such as j . Then, click the " " button at the bottom of the interface to enter the next step of setting. ⑤ After entering the next step of setting, you need to load the packages needed for the project in this interface , and these packages include the related ones . These packages are located in the system'sJava/home/hadoop/workspace/WordCountJRELinuxJDKdk1.8.0_162Next>
insert image description here
JavaJARJARHadoopJava APIJARLinuxHadoopThe installation directory, for this tutorial, is the " /usr/local/hadoop/share/hadoop" directory. LibrariesClick the “ ” tab in the interface , and then click the “ Add External JARs…” button on the right side of the interface to pop up the interface as shown in the figure below.
insert image description here
⑥In this interface, there is a row of directory buttons (namely " usr", " local", " hadoop", " share", " hadoop", " mapreduce" and " lib"), when you click a certain button, it will be listed below the contents of the directory.
In order to write a MapReduceprogram, it is generally necessary to add the following packages to Javathe project : a. and under the " " directory ; b. All packages under the " " directory ; c. All packages under the " " directory , but excluding , , and Table of contents. ⑦ Write an application, ie . In the “ ” panel on the left side of the working interface (as shown in the figure below), find the project name “ ” created just now , then right-click the project name, and select the “ ” menu in the pop-up menu . ⑧After selecting the “ ” menu, an interface as shown in the figure below will appear. In this interface, you only need to input the newly createdJAR
/usr/local/hadoop/share/hadoop/commonhadoop-common-3.1.3.jarhaoop-nfs-3.1.3.jar
/usr/local/hadoop/share/hadoop/common/libJAR
/usr/local/hadoop/share/hadoop/mapreduceJARjdiffliblib-examplessources
insert image description here
JavaWordCount.javaEclipsePackage ExplorerWordCountNew–>Class
insert image description here
New–>ClassNameJavaWordCountThe name of the class file, the name " " is used here , and the default settings can be used for others. Then, click Finishthe button " " in the lower right corner of the interface.
insert image description here
⑨ It can be seen that Eclipsea source code file named " " is automatically created WordCount.java, and contains the code " public class WordCount{}", clear the code in the file, and then enter the complete word frequency statistics program code in the file.
insert image description here

(2) Configure eclipsethe environment and run the program of word frequency statistics.

(1) Compile and package the program
①Compile the code written above, and directly click Eclipsethe shortcut button to run the program on the upper part of the working interface Run as. Select " Java Application", as shown in the figure below.
insert image description here

②Then, the interface as shown in the figure below will pop up, click the " OK" button in the lower right corner of the interface to start running the program.
insert image description here
③After the program finishes running, Consolethe running result information will be displayed in the “ ” panel at the bottom (as shown in the figure below).
insert image description here
④ Next, you can package Javathe application to generate JARa package and deploy it to Hadoopthe platform to run. Now you can put the word frequency statistics program in the " /usr/local/hadoop/myapp" directory. If the directory does not exist, you can use the following command to create it.
cd /usr/local/hadoop
mkdir myapp
⑤In the EclipsePackage Explorer” panel on the left side of the working interface, WordCountclick the right mouse button on the project name “ ”, and select “ ” in the pop-up menu Export, as shown in the figure below.
insert image description here
⑥ Then the interface as shown in the figure below will pop up, select " " in this interface Runnable JAR file.
insert image description here
⑦ Then, click the " Next>" button, and the interface as shown in the figure below will pop up. In this interface, " Launch configuration" is used to set the main class to run when the generated JARpackage is deployed and started. You need to select the class " " configured just now in the drop-down list WordCount-WordCount. In " Export destination", you need to set JARthe directory where the package will be output and saved, for example, set it to " /usr/local/hadoop/myapp/WordCount.jar" here. Library handlingSelect " " under " Extract required libraries into generated JAR".
insert image description here
⑧Then click the “ Finish” button, and the interface as shown in the figure below will appear.
insert image description here
⑨You can ignore the information on this interface, and directly click the " OK" button in the lower right corner of the interface to start the packaging process. After the packaging process is completed, a warning message interface will appear, as shown in the figure below.
insert image description here
⑩You can ignore the information on this interface and directly click the " OK" button in the lower right corner of the interface. So far, it has successfullyWordCountThe project package is generated WordCount.jar. You can Linuxcheck the generated WordCount.jarfiles in the system, and Linuxexecute the following command in the terminal, and you can see that /usr/local/hadoop/myappthere is already a WordCount.jarfile in the " " directory.
insert image description here
(2) Running the program
① Before running the program, it needs to be started Hadoop.
insert image description here
②After startup Hadoop, you need to delete the directory corresponding HDFSto the current Linuxuser (namely the “ ” and “ ” directories in ), so as to ensure that there will be no problems in the subsequent program operation. ③ Then, create a new directory corresponding to the current user in , that is, the " " directory. ④Then upload the two newly created files in the local file system (the two files are located in the " " directory and contain some English sentences) to the " " directory in . ⑤ If the directory “ ” already exists in , use the following command to delete the directory. ⑥ Now you can use the command to run the program in the system. After the command is executed, when the operation ends successfully, information similar to the following will be displayed on the screen. ⑦ At this time, the word frequency statistics results have been written into the "hadoopinputoutputHDFS/user/hadoop/input/user/hadoop/output
insert image description here
HDFSLinuxhadoopinput/user/hadoop/input
insert image description here
Linuxwordfile1.txtwordfile2.txt/usr/local/hadoopHDFS/user/hadoop/input
insert image description here
HDFS/user/hadoop/output
insert image description here
Linuxhadoop jar
insert image description here
HDFS/user/hadoop/output"In the directory, execute the following command and the following word frequency statistics results will be displayed on the screen.
insert image description here
So far, the word frequency statistics program has run smoothly. It should be noted that if you want to run it again WordCount.jar, you need to delete the directory first HDFS, outputotherwise an error will be reported.

(3) Write a MapReduce program to realize the program for calculating the average score.

import java.io.IOException;
import java.util.Iterator;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class Score {
    
    
    public static class Map extends
            Mapper<LongWritable, Text, Text, IntWritable> {
    
    
        // 实现map函数
        public void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {
    
    
            // 将输入的纯文本文件的数据转化成String
            String line = value.toString();
            // 将输入的数据首先按行进行分割
            StringTokenizer tokenizerArticle = new StringTokenizer(line, "\n");
            // 分别对每一行进行处理
            while (tokenizerArticle.hasMoreElements()) {
    
    
                // 每行按空格划分
                StringTokenizer tokenizerLine = new StringTokenizer(tokenizerArticle.nextToken());
                String strName = tokenizerLine.nextToken();// 学生姓名部分
                String strScore = tokenizerLine.nextToken();// 成绩部分
                Text name = new Text(strName);
                int scoreInt = Integer.parseInt(strScore);
                // 输出姓名和成绩
                context.write(name, new IntWritable(scoreInt));
            }
        }
    }

    public static class Reduce extends
            Reducer<Text, IntWritable, Text, IntWritable> {
    
    
        // 实现reduce函数
        public void reduce(Text key, Iterable<IntWritable> values,
                Context context) throws IOException, InterruptedException {
    
    
            int sum = 0;
            int count = 0;
            Iterator<IntWritable> iterator = values.iterator();
            while (iterator.hasNext()) {
    
    
                sum += iterator.next().get();// 计算总分
                count++;// 统计总的科目数
            }
            int average = (int) sum / count;// 计算平均成绩
            context.write(key, new IntWritable(average));
        }
    }

    public static void main(String[] args) throws Exception {
    
    
        Configuration conf = new Configuration();
        // "localhost:9000" 需要根据实际情况设置一下
        conf.set("mapred.job.tracker", "localhost:9000");
        // 一个hdfs文件系统中的 输入目录 及 输出目录
        String[] ioArgs = new String[] {
    
     "input/score", "output" };
        String[] otherArgs = new GenericOptionsParser(conf, ioArgs).getRemainingArgs();
        if (otherArgs.length != 2) {
    
    
            System.err.println("Usage: Score Average <in> <out>");
            System.exit(2);
        }

        Job job = new Job(conf, "Score Average");
        job.setJarByClass(Score.class);
        // 设置Map、Combine和Reduce处理类
        job.setMapperClass(Map.class);
        job.setCombinerClass(Reduce.class);
        job.setReducerClass(Reduce.class);
        // 设置输出类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        // 将输入的数据集分割成小数据块splites,提供一个RecordReder的实现
        job.setInputFormatClass(TextInputFormat.class);
        // 提供一个RecordWriter的实现,负责数据输出
        job.setOutputFormatClass(TextOutputFormat.class);
        // 设置输入和输出目录
        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

(4) MapReduceWhat is the working principle?

Analyze the working principle from the perspective of Client, , JobTrackerand .TaskTrackerMapReduce

insert image description here

  First of all, the client needs to write a good mapreduceprogram, that is, the mapreduceconfigured job, and jobthe next step is to start jobit. The start jobis to inform the user JobTrackerto run the job. At this time, JobTrackera new jobtask IDvalue will be returned to the client, and then it will Do a check operation. This check is to determine whether the output directory exists. If it exists, it jobwill not run normally and JobTrackerwill throw an error to the client. Next, check whether the input directory exists. If it does not exist, it will also throw an error. If it exists JobTrackerThe input fragmentation ( Input Split) will be calculated according to the input, and an error will be thrown if the fragmentation cannot be calculated. After all these are done, the required resources JobTrackerwill be configured . JobAfter getting it jobID, copy the resource files needed to run the job to HDFSit, including MapReducethe program packaged JARfiles, configuration files, and calculated input fragmentation information. These files are stored in jobTrackera folder created specifically for the job, named for that job Job ID. JARThere will be 10 copies of the file by default ( mapred.submit.replicationproperty control); enter the fragmentation information to JobTrackertell how many maptasks should be started for this job and other information. When the resource folder is created, the client will submit jobto inform jobTrackerme that the required resources have been written hdfs, and then please help me to execute it job.
  After the resources are allocated, the job will be initialized after JobTrackerreceiving the submission request. The main thing to initialize is tojobJobPut it into an internal queue and wait for the job scheduler to schedule it. When the job scheduler schedules the job according to its own scheduling algorithm, the job scheduler creates a running jobobject (encapsulating the task and recording information) in order to JobTrackertrack jobthe status and progress of the job. When creating joban object, the job scheduler will obtain hdfsthe input fragmentation information in the folder, input splitcreate a maptask for each according to the fragmentation information, and assign the map task to tasktrackerexecution. For mapand reducetasks, tasktrackerthere is a fixed number of mapslots and reduceslots based on the number of host cores and the size of memory. What needs to be emphasized here is that maptasks are not randomly assigned to someone tasktracker, which involves data localization to be discussed later.
  The next step is to assign tasks. At this time, tasktrackera simple loop mechanism will be run to send heartbeats periodically jobtracker. The heartbeat interval is 5 seconds. The programmer can configure this time. The heartbeat is a bridge to communicate jobtrackerwith . The processing status and problems can be obtained , and the operation instructions for it can also be obtained through the return value in the heartbeat . It will obtain the resources needed to run, such as code, etc., and prepare for the actual execution. After the task is assigned, it is the execution of the task. During task execution, the status and progress can be monitored through the heartbeat mechanism , and at the same time, the entire status and progress can be calculated , and you can also monitor your own status and progress locally.tasktrackerjobtrackertasktrackertasktrackertasktrackerjobtrackertasktrackerjobjobtrackertasktrackerjobtasktrackerTaskTrackerEvery once in a while JobTracker, a heartbeat will be sent to tell JobTrackerit that it is still running. At the same time, the heartbeat also carries a lot of information, such as mapthe progress of the current task completion and other information. When jobtrackerthe last tasktrackernotification that the operation of the specified task is completed is successful, jobtrackerthe entire status will jobbe set as successful, and then when the client queries jobthe running status (note: this is an asynchronous operation), the client will check jobthe completion notification. If jobit fails halfway, mapreducethere will be a corresponding mechanism to deal with it. Generally speaking, if it is not the programmer's program itself bug, mapreducethe error handling mechanism can ensure that the submission jobcan be completed normally.

(5) HadoopHow to run MapReducethe program?

① Connect the compiled software with hadoop(such as Eclipseunlink hadoop), and run the program directly.
mapreducePackage the program into jara file.

Guess you like

Origin blog.csdn.net/weixin_51571728/article/details/125313771