为什么要用Hadoop

大量的数据，如果需要计算（CPU密集型）并快速的处理得到结果，使用传统的做法（eg：单节点中线程的并发执行，能达到一个充分利用CPU的目的）无法达到快速的效果；这个时候就需要使用多进程，并使其分布在多个节点上，让多个CPU去执行，来达到一个计算（CPU密集型）并快速处理的目的。

解决了什么问题：

HDFS（Hadoop Distributed File System,Hadoop分布式数据存储）：将大量的数据，存储到各个节点中去

MapReduce（分布式数据分析模型）：根据模型去写程序，然后将程序调度给yarn，完成调度到各个节点

yarn（资源管理调度）：分配jar包到各个节点去，并申请一定的资源，在这个资源中（称为容器）去运行jar

具体的应用功能场景：

对海量的日志文件进行分析

HDFS数据写入过程图：

NameNode:管理节点，用于存储文件在DataNode上的位置信息

DataNode：工作节点，存储各个切分后的文件

Spring Boot操作hdfs工具类（源码地址：https://gitee.com/SnailPu/springBootDemo）：

/**
 * 在对hdfs进行操作时，因为Windows下的用户原因，发生异常（org.apache.hadoop.security.AccessControlException），需要对hdfs权限设置
 * 参考文章：https://blog.csdn.net/wang7807564/article/details/74627138
 */
@Component
public class HdfsUtils {

    @Value("${hdfs.path}")
    private String hdfsPath;
    @Value("${hdfs.username}")
    private String hdfsUsername;
    private static final int bufferSize = 1024 * 1024 * 64;

    /**
     * 获取HDFS配置信息
     */
    private Configuration getConfiguration() {
        Configuration configuration = new Configuration();
        //使用Hadoop的core-site中的fs.defaultFS参数，防止...file///...错误的出现
        configuration.set("fs.defaultFS", hdfsPath);
        return configuration;
    }

    /**
     * 获取HDFS文件系统对象
     */
    public FileSystem getFileSystem() throws Exception {
        // 客户端去操作hdfs时是有一个用户身份的，默认情况下hdfs客户端api会从jvm中获取一个参数作为自己的用户身份
        // DHADOOP_USER_NAME=hadoop
        // 也可以在构造客户端fs对象时，通过参数传递进去
//        FileSystem fileSystem = FileSystem.get(new URI(hdfsPath), getConfiguration(), hdfsName);
        FileSystem fileSystem = FileSystem.get(getConfiguration());
        return fileSystem;
    }

    /**
     * 拼接路径为hdfs中的
     *
     * @param path 路径参数
     */
    public String pathInHdfs(String path) {
        return hdfsPath + path;
    }

    /**
     * 创建目录
     *
     * @param path
     * @return
     * @throws Exception
     */
    public boolean mkdir(String path) throws Exception {

        FileSystem fs = getFileSystem();
        String pathInHdfs = pathInHdfs(path);
        boolean b = fs.mkdirs(new Path(pathInHdfs));
        return b;
    }

    /**
     * 判断HDFS文件或目录是否存在,使用新创建的fs
     *
     * @param path
     * @return
     * @throws Exception
     */
    public boolean exits(String path) throws Exception {
        if (StringUtils.isEmpty(path)) {
            return false;
        }
        FileSystem fs = getFileSystem();
        try {
            Path srcPath = new Path(pathInHdfs(path));
            boolean isExists = fs.exists(srcPath);
            return isExists;
        } finally {
            fs.close();
        }
    }

    /**
     * 判断HDFS文件或目录是否存在,使用外部传入的fs，不关闭，由外部方法关闭
     * 重载 exits
     *
     * @param path
     * @return
     * @throws Exception
     */
    public boolean exits(String path, FileSystem fs) throws Exception {
        if (StringUtils.isEmpty(path)) {
            return false;
        }
        Path srcPath = new Path(pathInHdfs(path));
        boolean isExists = fs.exists(srcPath);
        return isExists;
    }

    /**
     * 删除HDFS文件或目录
     *
     * @param path
     * @return
     * @throws Exception
     */
    public Boolean deleteFile(String path) throws Exception {
        if (StringUtils.isEmpty(path)) {
            return false;
        }
        FileSystem fs = getFileSystem();
        if (!exits(path, fs)) {
            return false;
        }
        try {
            Path srcPath = new Path(pathInHdfs(path));
            boolean isOk = fs.deleteOnExit(srcPath);
            return isOk;
        } finally {
            fs.close();
        }
    }
}

获取fileSystem的源码大致过程：

MapReduce中Job的提交工作流程：

ResourceManger：负责集群资源的管理和对Job的调度、注册等
NodeManger：监控执行Job容器的资源使用情况，并汇报给ResourceManger
yarn在的集群中有resourceManger和nodeManger进程，负责完成对资源的调度分配（container硬件资源，文件资源）。yarn这样的设计，是为了承载更多的运算方式，如MapReduce，spark，strom。
MapReduce负责程序的具体运行，MRAppMaster决定不同的机器运行完成map或者reduce任务
提交运行过程中，会依次增加RunJar，MRAppMaster，YarnChild进程

yarn资源调度器队列介绍与配置参考：http://itxw.net/article/376.html