Hadoop编程实践

以下是我云计算实验的作业，完成作业的过程中碰到了许多问题，但是最后都一一解决了，这个过程蛮痛苦的，但是完成的一瞬间如释重负，有问题欢迎大家与我交流！

（1）每人在自己本地电脑上正确安装和运行伪分布式Hadoop系统。

（2）安装完成后，自己寻找一组英文网页数据，在本机上运行Hadoop系统自带的WordCount可执行程序文件，并产生输出结果。

（3）实现并测试矩阵相乘程序（选做）

1、安装虚拟机软件VMware Workstation，并进入网址https://ubuntu.com/download/desktop下载Ubuntu镜像。

2、使用指令sudo useradd -m lyd40213 -s /bin/bash创建用户lyd40213，使用指令sudo passwd lyd40213 为用户设置密码。使用指令sudo adduser lyd40213 sudo增加管理员权限。之后注销用户，使用lyd40213登录。

3、根据CSDN博客学习安装VMware Tools后，点击VMware Workstation窗口左上方的“虚拟机”、“设置”，弹出的对话框中依次点击“选项”、“共享文件夹”、“下一步”，新建共享文件夹，使Windows系统与虚拟机内的Ubuntu可以进行交互。

4、将在以下网址提前下载好的jdk和hadoop的安装包放到本机的共享文件夹中，使用指令ls我们可以看到两个安装包，如图所示。

https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html

https://mirrors.cnnic.cn/apache/hadoop/common/hadoop-2.10.1/

5、使用指令sudo apt-get update更新apt-get。

6、使用指令apt-get install openssh-server，安装ssh服务，之后使用指令apt-get install vim ，安装vim编辑器。

7、依次使用指令ufw disable 、ufw status来禁用防火墙。之后cd ~/.ssh、ssh-keygen -t rsa两条语句，生成免密密钥对。使用ssh-copy-id localhost指令将公钥复制到localhost，未成功，之后使用cat ./id_rsa.pub >> ./authorized_keys，完成复制。使用ssh localhost完成测试，不需要登录密码，如图。之后退出，进行jdk和hadoop的配置。

8、首先配置jdk，使用mkdir -/app创建/app文件夹，作为jdk和hadoop的安装文件夹。之后对共享文件夹内的压缩包进行解压，使用tar -zxvf /mnt/hgfs/vmshare/jdk-8u271-linux-x64.tar.gz -C ~/app指令，之后使用mv ~/app/jdk1.8.0_271 ~/app/jdk指令进行更名。

9、配置文件：vim编辑器

指令vim ~/.bashrc进行编辑，将以下代码添加到文件尾。

export JAVA_HOME=/home/lyd40213/app/jdk
　　　　　export JRE_HOME=$JAVA_HOME/jre
　　　　　export CLASSPATH=.:$CLASSPATH:$JAVA_HOME/lib:$JRE_HOME/lib
　　　　　export PATH=$PATH:$JAVA_HOME/bin:$JRE_HOME/bin

指令source ~/.bashrc，文件编译成功，测试jdk，如图1.8.0.271

10、配置hadoop，使用tar -zxvf /mnt/hgfs/vmshare/hadoop-2.10.1.tar.gz -C ~/app 进行解压，之后使用指令mv ~/app/hadoop-2.10.1 ~/app/hadoop进行改名。之后换到超级模式使用指令sudo chown -R lyd40213 /home/lyd40213/app/hadoop修改权限。

11、使用指令vim ~/.bashrc，修改文件，在文件尾添加如下两条指令。使用指令source ~/.bashrc 进行编译生效。

export HADOOP_HOME=/home/lyd40213/app/hadoop

export PATH=$PATH:$JAVA_HOME/bin:$JRE_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

12.、使用指令vim ~/app/hadoop/etc/hadoop/hadoop-env.sh，找到export JAVA_HOME=${JAVA_HOME}这一行，将其修改为：export JAVA_HOME=/home/lyd40213/app/jdk。

13、配置core-site.xml，直接在目录/app/hadoop/etc/hadoop/中找到它，进行修改，

将

<configuration>

</configuration>

更改为

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/lyd40213/app/hadoop/tmp</value> 
</property>
</configuration>

配置hdfs-site.xml

将

<configuration>

</configuration>

更改为

<configuration>
<property>


<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

配置mapred-site.xml

将

<configuration>

</configuration>

更改为

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

配置yarn-site.xml

将

<configuration>

</configuration>

更改为

<configuration>

<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>

14、运行hadoop，初始化：先cd ~/app/hadoop/bin，之后./hadoop namenode -format。使用指令cd ~/app/hadoop/sbin和./start-all.sh运行hadoop。jps查看6个进程。如图：

15、测试wordcount程序，使用以下指令

mkdir ~/tmp

echo'In the physical sciences, progress in understanding large complex systems has often come by approximating their constituents with random variables; for example, statistical physics and thermodynamics are based in this paradigm. Since modern neural networks are undeniably large complex systems, it is natural to consider what insights can be gained by approximating their parameters with random variables. Moreover, such random configurations play at least two privileged roles in neural networks: they define the initial loss surface for optimization, and they are closely related to random feature and kernel methods. Therefore it is not surprising that random neural networks have attracted significant attention in the literature over the years' > ~/tmp/word1.txt

echo'Throughout this work we will be relying on a number of basic concepts from random matrix theory. Here we provide a lightning overview of the essentials, but refer the reader to the more pedagogical literature for background' > ~/tmp/word2.txt

之后，指令./hdfs dfs -mkdir /input，在hdfs上新建目录，指令./hdfs dfs -put ~/tmp/word*.txt /input上传文件，指令./hadoop jar ~/app/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.10.1.jar wordcount /input output运行wordcount程序。跑通程序如图：

16、指令 ./hdfs dfs -cat /user/lyd40213/output/part-r-00000查看结果。

17、在hadoop根目录下新建一个叫做local_matrix的文件夹，之后将MartrixMultiplication.java放在里面，MartrixMultiplication.java内容为：

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class MartrixMultiplication{
  public static class MartrixMapper extends Mapper<Object, Text, Text, Text>{
    private Text map_key = new Text();
    private Text map_value = new Text();
    int rNumber = 300;
    int cNumber = 500;
    String fileTarget;
    String i, j, k, ij, jk;
    public void map(Object key, Text value, Context context) throws IOException, InterruptedException {

      
        String eachterm[] = value.toString().split("#");
        fileTarget = eachterm[0];
        if(fileTarget.equals("M")){
          i = eachterm[1];
          j = eachterm[2];
          ij = eachterm[3];
          for(int c = 1; c<=cNumber; c++){
              map_key.set(i + "#" + String.valueOf(c));
              map_value.set("M" + "#" + j + "#" + ij);
              context.write(map_key, map_value);
          } 
        }else if(fileTarget.equals("N")){
          j = eachterm[1];
          k = eachterm[2];
          jk = eachterm[3];
          for(int r = 1; r<=rNumber; r++){
              map_key.set(String.valueOf(r) + "#" +k);
              map_value.set("N" + "#" + j + "#" + jk);
              context.write(map_key, map_value);
          }
        }
    }
  } 
  public static class MartrixReducer extends Reducer<Text,Text,Text,Text> {
    private Text reduce_value = new Text();
    int jNumber = 150;
    int M_ij[] = new int[jNumber+1];
    int N_jk[] = new int[jNumber+1];
    int j, ij, jk;
    String fileTarget;
    int jsum = 0; 
    public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
      jsum = 0; 
      for (Text val : values) {
        String eachterm[] = val.toString().split("#");
        fileTarget = eachterm[0];
        j = Integer.parseInt(eachterm[1]);
        if(fileTarget.equals("M")){
      	  ij = Integer.parseInt(eachterm[2]);
      	  M_ij[j] = ij;
        }else if(fileTarget.equals("N")){
      	  jk = Integer.parseInt(eachterm[2]);

      	  N_jk[j] = jk;
        } 
      } 
      for(int d = 1; d<=jNumber; d++){
    	 jsum +=  M_ij[d] * N_jk[d];
      } 
      reduce_value.set(String.valueOf(jsum));
      context.write(key, reduce_value); 
    }
  }
  public static void main(String[] args) throws Exception {
      Configuration conf = new Configuration();
      String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
      if (otherArgs.length != 2) {
	  System.err.println("Usage: MartrixMultiplication <in> <out>");
	  System.exit(2);
      } 
      Job job = new Job(conf, "martrixmultiplication");
      job.setJarByClass(MartrixMultiplication.class);
      job.setMapperClass(MartrixMapper.class);
      job.setReducerClass(MartrixReducer.class); 
      job.setOutputKeyClass(Text.class);
      job.setOutputValueClass(Text.class); 
      FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
      FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); 
      System.exit(job.waitForCompletion(true) ? 0 : 1); 
  } 
}

生成数据M.data放在本地一个叫做input2的文件夹里面，M.data为50*50矩阵，太大了，就不放在这里了，可以私信我或加我QQ1320496612要。

使用指令(两行为同一句)

javac -classpath share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.10.1.jar:share/hadoop/common/hadoop-common

2.10.1.jar:share/hadoop/common/lib/commons-cli-1.2.jar -d local_matrix -Xlint:deprecation local_matrix/MartrixMultiplication.java

生成class，如图：

18、指令jar -cvf local_matrix/MartrixMultiplication.jar -C local_matrix/ . 打包jar。

19、之后执行以下三条指令，上传运行，并下载：

./hadoop fs -copyFromLocal ../input2 /in2

./hadoop jar ../local_matrix/MartrixMultiplication.jar MartrixMultiplication /in2 out2

./hadoop fs -get out2 output2

20、指令cat output2/*查看结果

三、遇到的问题及解决方法

1.采用Oracel VM VirtalBox虚拟机出现问题，换用VMware Workstation虚拟机。开始是11.0的VMware Workstation，安装系统时总会卡在一个界面，之后查询资料后，更换了16.0的版本，解决了这个问题。

2.安装VM Tools出现“没有足够的空间以提取…”的情况。首先，右键使用归档管理器打开，之后，右键文件，选择提取，最后，提取到桌面。然后解决了这个问题。

3.root账号无法登录SSH问题- Permission denied, please try again. 此时，系统默认禁止root用户登录ssh。解决办法：输入su - 后，输入 vi /etc/ssh/sshd_config 编辑sshd_config文件，找到：

# Authentication:
#LoginGraceTime 120
#PermitRootLogin without-password
#PermitRootLogin yes
#StrictModes yes

改为：

# Authentication:
LoginGraceTime 120
#PermitRootLogin without-password
PermitRootLogin yes
StrictModes yes

保存退出。PermitRootLogin意思为允许免密登录。

4.无法正确启动Hadoop，发现四个文件core-site.xml、hdfs-site.xml、mapred-site.xml、yarn-site.xml缩进存在问题，前面的不是空格，为非法字符，修改缩进后，成功运行Hadoop。

5.jps查看进程缺少Namenode节点，通过stop-all.sh、hadoop namenode -format、start-all.sh重新格式化Namenode节点完美解决。

6.jps查看进程缺少Datanode节点，原因为多次的namenode格式化，使name和data的ID号变化，造成启动失败。解决办法：清除dfs下所有文件，cd /user/local/hadoop/tmp/dfs指令后，rm -r *。再之后cd /usr/local/hadoop指令后，hdfs namenode -format指令重新格式化namenode。重启./sbin/start-dfs.sh后，jps发现Datanode出现。

7.出现问题Name node is in safe mode，Namenode处于安全模式，采用指令hadoop dfsadmin -safemode leave解决问题，出现Safe mode is OFF，成功解除安全模式。

8.运行mapreduce，出现卡在map 100% reduce 0%这个位置的情况，通过虚拟机->设置->增大系统的内存完美解决问题。

9.在执行生成class指令时，出现*.java使用或覆盖了已过时的API错误，添加-Xlint:deprecation改为javac -Xlint:deprecation *.java 重新编译，解决问题。

10.服务器的in2文件夹和out2文件夹总是出错，通过多次增删文件夹解决了问题。

猜你喜欢