(1) Basic
MapReduce
programming to realize word frequency statistics.①In
/user/hadoop/input
the folder (the folder is empty), create a filewordfile1.txt
andwordfile2.txt
upload it toHDFS
theinput
folder.
The content of the filewordfile1.txt
is as follows: The content of
I love Spark
I love Hadoop
the filewordfile2.txt
is as follows:
Hadoop is good
Spark is fast
②StartEclipse
, after starting, the interface as shown in the figure below will pop up, prompting to set the workspace (workspace
). You can directly adopt the default setting "/home/hadoop/workspace
" and click the "OK
" button. It can be seen that since the currenthadoop
user is used to log in toLinux
the system, the default workspace directory is located underhadoop
the user directory "/home/hadoop
".
③Eclipse
After starting, select the “File–>New–>Java Project
” menu to start creating aJava
project.
④InputProject name
the project name " " behind "WordCount
", select " ", and let all the files ofUse default location
this project be saved in the " " directory. In the " " tab, you can choose what has been installed in the current system , such as j . Then, click the " " button at the bottom of the interface to enter the next step of setting. ⑤ After entering the next step of setting, you need to load the packages needed for the project in this interface , and these packages include the related ones . These packages are located in the system'sJava
/home/hadoop/workspace/WordCount
JRE
Linux
JDK
dk1.8.0_162
Next>
Java
JAR
JAR
Hadoop
Java API
JAR
Linux
Hadoop
The installation directory, for this tutorial, is the "/usr/local/hadoop/share/hadoop
" directory.Libraries
Click the “ ” tab in the interface , and then click the “Add External JARs…
” button on the right side of the interface to pop up the interface as shown in the figure below.
⑥In this interface, there is a row of directory buttons (namely "usr
", "local
", "hadoop
", "share
", "hadoop
", "mapreduce
" and "lib
"), when you click a certain button, it will be listed below the contents of the directory.
In order to write aMapReduce
program, it is generally necessary to add the following packages toJava
the project : a. and under the " " directory ; b. All packages under the " " directory ; c. All packages under the " " directory , but excluding , , and Table of contents. ⑦ Write an application, ie . In the “ ” panel on the left side of the working interface (as shown in the figure below), find the project name “ ” created just now , then right-click the project name, and select the “ ” menu in the pop-up menu . ⑧After selecting the “ ” menu, an interface as shown in the figure below will appear. In this interface, you only need to input the newly createdJAR
/usr/local/hadoop/share/hadoop/common
hadoop-common-3.1.3.jar
haoop-nfs-3.1.3.jar
/usr/local/hadoop/share/hadoop/common/lib
JAR
/usr/local/hadoop/share/hadoop/mapreduce
JAR
jdiff
lib
lib-examples
sources
Java
WordCount.java
Eclipse
Package Explorer
WordCount
New–>Class
New–>Class
Name
Java
WordCount
The name of the class file, the name " " is used here , and the default settings can be used for others. Then, clickFinish
the button " " in the lower right corner of the interface.
⑨ It can be seen thatEclipse
a source code file named " " is automatically createdWordCount.java
, and contains the code "public class WordCount{}
", clear the code in the file, and then enter the complete word frequency statistics program code in the file.
(2) Configure
eclipse
the environment and run the program of word frequency statistics.(1) Compile and package the program
①Compile the code written above, and directly clickEclipse
the shortcut button to run the program on the upper part of the working interfaceRun as
. Select "Java Application
", as shown in the figure below.
②Then, the interface as shown in the figure below will pop up, click the "
OK
" button in the lower right corner of the interface to start running the program.
③After the program finishes running,Console
the running result information will be displayed in the “ ” panel at the bottom (as shown in the figure below).
④ Next, you can packageJava
the application to generateJAR
a package and deploy it toHadoop
the platform to run. Now you can put the word frequency statistics program in the "/usr/local/hadoop/myapp
" directory. If the directory does not exist, you can use the following command to create it.
cd /usr/local/hadoop
mkdir myapp
⑤In theEclipse
“Package Explorer
” panel on the left side of the working interface,WordCount
click the right mouse button on the project name “ ”, and select “ ” in the pop-up menuExport
, as shown in the figure below.
⑥ Then the interface as shown in the figure below will pop up, select " " in this interfaceRunnable JAR file
.
⑦ Then, click the "Next>
" button, and the interface as shown in the figure below will pop up. In this interface, "Launch configuration
" is used to set the main class to run when the generatedJAR
package is deployed and started. You need to select the class " " configured just now in the drop-down listWordCount-WordCount
. In "Export destination
", you need to setJAR
the directory where the package will be output and saved, for example, set it to "/usr/local/hadoop/myapp/WordCount.jar
" here.Library handling
Select " " under "Extract required libraries into generated JAR
".
⑧Then click the “Finish
” button, and the interface as shown in the figure below will appear.
⑨You can ignore the information on this interface, and directly click the "OK
" button in the lower right corner of the interface to start the packaging process. After the packaging process is completed, a warning message interface will appear, as shown in the figure below.
⑩You can ignore the information on this interface and directly click the "OK
" button in the lower right corner of the interface. So far, it has successfullyWordCount
The project package is generatedWordCount.jar
. You canLinux
check the generatedWordCount.jar
files in the system, andLinux
execute the following command in the terminal, and you can see that/usr/local/hadoop/myapp
there is already aWordCount.jar
file in the " " directory.
(2) Running the program
① Before running the program, it needs to be startedHadoop
.
②After startupHadoop
, you need to delete the directory correspondingHDFS
to the currentLinux
user (namely the “ ” and “ ” directories in ), so as to ensure that there will be no problems in the subsequent program operation. ③ Then, create a new directory corresponding to the current user in , that is, the " " directory. ④Then upload the two newly created files in the local file system (the two files are located in the " " directory and contain some English sentences) to the " " directory in . ⑤ If the directory “ ” already exists in , use the following command to delete the directory. ⑥ Now you can use the command to run the program in the system. After the command is executed, when the operation ends successfully, information similar to the following will be displayed on the screen. ⑦ At this time, the word frequency statistics results have been written into the "hadoop
input
output
HDFS
/user/hadoop/input
/user/hadoop/output
HDFS
Linux
hadoop
input
/user/hadoop/input
Linux
wordfile1.txt
wordfile2.txt
/usr/local/hadoop
HDFS
/user/hadoop/input
HDFS
/user/hadoop/output
Linux
hadoop jar
HDFS
/user/hadoop/output
"In the directory, execute the following command and the following word frequency statistics results will be displayed on the screen.
So far, the word frequency statistics program has run smoothly. It should be noted that if you want to run it againWordCount.jar
, you need to delete the directory firstHDFS
,output
otherwise an error will be reported.(3) Write a MapReduce program to realize the program for calculating the average score.
import java.io.IOException;
import java.util.Iterator;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class Score {
public static class Map extends
Mapper<LongWritable, Text, Text, IntWritable> {
// 实现map函数
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
// 将输入的纯文本文件的数据转化成String
String line = value.toString();
// 将输入的数据首先按行进行分割
StringTokenizer tokenizerArticle = new StringTokenizer(line, "\n");
// 分别对每一行进行处理
while (tokenizerArticle.hasMoreElements()) {
// 每行按空格划分
StringTokenizer tokenizerLine = new StringTokenizer(tokenizerArticle.nextToken());
String strName = tokenizerLine.nextToken();// 学生姓名部分
String strScore = tokenizerLine.nextToken();// 成绩部分
Text name = new Text(strName);
int scoreInt = Integer.parseInt(strScore);
// 输出姓名和成绩
context.write(name, new IntWritable(scoreInt));
}
}
}
public static class Reduce extends
Reducer<Text, IntWritable, Text, IntWritable> {
// 实现reduce函数
public void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
int sum = 0;
int count = 0;
Iterator<IntWritable> iterator = values.iterator();
while (iterator.hasNext()) {
sum += iterator.next().get();// 计算总分
count++;// 统计总的科目数
}
int average = (int) sum / count;// 计算平均成绩
context.write(key, new IntWritable(average));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
// "localhost:9000" 需要根据实际情况设置一下
conf.set("mapred.job.tracker", "localhost:9000");
// 一个hdfs文件系统中的 输入目录 及 输出目录
String[] ioArgs = new String[] {
"input/score", "output" };
String[] otherArgs = new GenericOptionsParser(conf, ioArgs).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: Score Average <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "Score Average");
job.setJarByClass(Score.class);
// 设置Map、Combine和Reduce处理类
job.setMapperClass(Map.class);
job.setCombinerClass(Reduce.class);
job.setReducerClass(Reduce.class);
// 设置输出类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
// 将输入的数据集分割成小数据块splites,提供一个RecordReder的实现
job.setInputFormatClass(TextInputFormat.class);
// 提供一个RecordWriter的实现,负责数据输出
job.setOutputFormatClass(TextOutputFormat.class);
// 设置输入和输出目录
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
(4) MapReduce
What is the working principle?
Analyze the working principle from the perspective of
Client
, ,JobTracker
and .TaskTracker
MapReduce
First of all, the client needs to write a good
mapreduce
program, that is, themapreduce
configured job, andjob
the next step is to startjob
it. The startjob
is to inform the userJobTracker
to run the job. At this time,JobTracker
a newjob
taskID
value will be returned to the client, and then it will Do a check operation. This check is to determine whether the output directory exists. If it exists, itjob
will not run normally andJobTracker
will throw an error to the client. Next, check whether the input directory exists. If it does not exist, it will also throw an error. If it existsJobTracker
The input fragmentation (Input Split
) will be calculated according to the input, and an error will be thrown if the fragmentation cannot be calculated. After all these are done, the required resourcesJobTracker
will be configured .Job
After getting itjobID
, copy the resource files needed to run the job toHDFS
it, includingMapReduce
the program packagedJAR
files, configuration files, and calculated input fragmentation information. These files are stored injobTracker
a folder created specifically for the job, named for that jobJob ID
.JAR
There will be 10 copies of the file by default (mapred.submit.replication
property control); enter the fragmentation information toJobTracker
tell how manymap
tasks should be started for this job and other information. When the resource folder is created, the client will submitjob
to informjobTracker
me that the required resources have been writtenhdfs
, and then please help me to execute itjob
.
After the resources are allocated, the job will be initialized afterJobTracker
receiving the submission request. The main thing to initialize is tojob
Job
Put it into an internal queue and wait for the job scheduler to schedule it. When the job scheduler schedules the job according to its own scheduling algorithm, the job scheduler creates a runningjob
object (encapsulating the task and recording information) in order toJobTracker
trackjob
the status and progress of the job. When creatingjob
an object, the job scheduler will obtainhdfs
the input fragmentation information in the folder,input split
create amap
task for each according to the fragmentation information, and assign the map task totasktracker
execution. Formap
andreduce
tasks,tasktracker
there is a fixed number ofmap
slots andreduce
slots based on the number of host cores and the size of memory. What needs to be emphasized here is thatmap
tasks are not randomly assigned to someonetasktracker
, which involves data localization to be discussed later.
The next step is to assign tasks. At this time,tasktracker
a simple loop mechanism will be run to send heartbeats periodicallyjobtracker
. The heartbeat interval is 5 seconds. The programmer can configure this time. The heartbeat is a bridge to communicatejobtracker
with . The processing status and problems can be obtained , and the operation instructions for it can also be obtained through the return value in the heartbeat . It will obtain the resources needed to run, such as code, etc., and prepare for the actual execution. After the task is assigned, it is the execution of the task. During task execution, the status and progress can be monitored through the heartbeat mechanism , and at the same time, the entire status and progress can be calculated , and you can also monitor your own status and progress locally.tasktracker
jobtracker
tasktracker
tasktracker
tasktracker
jobtracker
tasktracker
job
jobtracker
tasktracker
job
tasktracker
TaskTracker
Every once in a whileJobTracker
, a heartbeat will be sent to tellJobTracker
it that it is still running. At the same time, the heartbeat also carries a lot of information, such asmap
the progress of the current task completion and other information. Whenjobtracker
the lasttasktracker
notification that the operation of the specified task is completed is successful,jobtracker
the entire status willjob
be set as successful, and then when the client queriesjob
the running status (note: this is an asynchronous operation), the client will checkjob
the completion notification. Ifjob
it fails halfway,mapreduce
there will be a corresponding mechanism to deal with it. Generally speaking, if it is not the programmer's program itselfbug
,mapreduce
the error handling mechanism can ensure that the submissionjob
can be completed normally.(5)
Hadoop
How to runMapReduce
the program?① Connect the compiled software with
hadoop
(such asEclipse
unlinkhadoop
), and run the program directly.
②mapreduce
Package the program intojar
a file.