Hadoop结构认识以及相关作用

文章目录

Hadoop

官方介绍
http://hadoop.apache.org/

The Apache™ Hadoop® project develops open-source software for
reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers
using simple programming models. It is designed to scale up from
single servers to thousands of machines, each offering local
computation and storage. Rather than rely on hardware to deliver
high-availability, the library itself is designed to detect and handle
failures at the application layer, so delivering a highly-available
service on top of a cluster of computers, each of which may be prone
to failures.

hadoop是一个开源的软件库

  • 可靠
  • 可扩展
  • 分布式

主要模块

  • HDFS
  • MapReduce
  • Yarn

HDFS

介绍
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is part of the Apache Hadoop Core project. The project URL is http://hadoop.apache.org/.

总结几点

  • 高容错能力
  • 低硬件成本
  • 高吞吐量
  • 适用于大数据集

HDFS结构
在这里插入图片描述

  • Namenode
    • 一个HDFS集群只有一个Namenode 。
    • 管理命名空间及规范文件的访问。
  • Datanode
    • 一个HDFS存在多个Datanode。
    • 一个节点有一个Datanode。
    • 文件的读取,写入,删除,以及分块的添加和删除。

YARN

在这里插入图片描述
介绍
https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html

The fundamental idea of YARN is to split up the functionalities of resource management and job scheduling/monitoring into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job or a DAG of jobs.

  • yarn将资源管理功能和任务调度/监控功能分开

组成

  • ResourceManager: 主要组件Scheduler and ApplicationsManager
    • Scheduler :给各种运行的程序分配资源
    • ApplicationsManager:根据Scheuler发布的一些任务,执行程序

MapReduce

Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

  • 计算框架,做集群数据处理的。
  • x1版本包括资源,x2分离出yarn解耦合
  • spark替代MapReduce

猜你喜欢

转载自blog.csdn.net/qq_36325121/article/details/108761575
今日推荐