Big Data (4) Mainstream Big Data Technology
1. The words written in front
To those good girls (good boys) who are tortured and beaten:
There are some things we can't choose, and we can't avoid being hurt.
But please remember at all times:
You may be worthless in front of some people, and you may be hurt badly,
But it will definitely be a priceless treasure in the hearts of other people, and he (she) will regard you as more important than himself.
Just be yourself, no need to deliberately change anything, people who love you will naturally love you.
Remember your worth! You live in your own world, in the hearts of those who love you!
Sober in adversity
2023.8.27
2. Big data technology
Mainstream big data technologies can be divided into two categories.
One type is oriented to non-real-time batch processing business scenarios , focusing on processing TB-level and PB-level massive data storage, processing, analysis, and applications that traditional data processing technologies cannot handle in a limited space-time environment. Such as: user behavior analysis, order anti-fraud analysis, user loss analysis, data warehouse, etc. The characteristics of such business scenarios are non-real-time response. Usually, some units extract all kinds of data into the big data analysis platform at the end of the night's trading, obtain the calculation results within a few hours, and use them for the next day's business. The mainstream supporting technologies are HDFS, MapReduce, Hive, etc.
Another type of business scenarios for real-time processing , such as Weibo applications, real-time social networking, real-time order processing, etc. This type of business scenario is characterized by strong real-time response. When a user sends a business request, a response must be given within a few seconds, and ensure that data integrity. The more mainstream supporting technologies are HBase, Kafka, Storm, etc.
(1)HDFS
HDFS refers to the Hadoop Distributed File System (Hadoop Distributed File System), which is the core component of the Apache Hadoop project. It is a distributed file system with high availability, high reliability, high scalability, and high fault tolerance. It can store and process large-scale data sets on a group of cheap computers, and realizes the parallel processing of large-scale data by distributing data and computing tasks among multiple nodes.
HDFS is the core sub-project of Hadoop and the basis for data storage and access of the entire Hadoop platform. Based on the file system on the Linux native file system. On top of this, it hosts the operation of other sub-projects such as MapReduce and HBase. It is a distributed file system that is easy to use and manage.
HDFS is a highly fault-tolerant system suitable for deployment on cheap machines. HDFS can provide high-throughput data access and is very suitable for applications on large-scale data sets. HDFS relaxes some POSIX constraints to achieve the purpose of streaming file system data.
HDFS technology has the following characteristics:
1. Large-scale storage: HDFS can handle large-scale data sets at the PB level, and supports distributed storage and management of data files.
2. High reliability: HDFS is a data storage method based on redundancy, which distributes data to different nodes, and the downtime of any server will not affect the integrity and availability of data.
3. High scalability: HDFS can run on hundreds of machines and supports dynamic expansion, which is convenient for users to expand with the growth of data volume.
4. High fault tolerance (fault-tolerant): HDFS achieves data fault tolerance through multiple copies of data blocks. If a node goes down, other nodes can continue to provide data services.
5. Efficiency: HDFS supports batch reading and writing of data, provides an efficient data transmission mechanism, and can realize fast data transmission and processing in the cluster.6. High throughput
7.HDFS relaxes (relax) the requirements of POSIX (requirements) so that the data in the file system can be accessed in the form of stream (streaming access).
In short, HDFS technology is one of the necessary technologies to realize big data storage, management, and processing. It can provide efficient and reliable data storage solutions for enterprises in different industries.
(2)MapReduce
MapReduce is a software architecture that processes massive amounts of data in a parallel computing manner in a cluster composed of thousands of ordinary hardware. This computing framework has high stability and fault tolerance. MapReduce highly reduces the responsible logic, which is abstracted into Mapper and Reducer classes. Complex logic is transformed into a pattern that conforms to MapReduce function processing through understanding.
A MapReduce job divides the input data set into independent computing blocks, and these blocks are processed by map tasks in a completely parallel and independent mode. The MapReduce framework sorts the output of maps, and after sorting, the data is used as the input data of the reduce task. Both input and output data of a job are stored in the HDFS file system. The computing framework manages job scheduling, monitors jobs, and re-executes failed tasks.
MapReduce is a distributed computing framework for large-scale data processing.
The MapReduce software architecture can be divided into the following three levels :
♦ Application layer: MapReduce application developers use the MapReduce API to write applications, decomposing the problem into individual tasks that can be processed in parallel. These tasks are divided into two stages: the map stage and the reduce stage. In the map stage, the data is split into small pieces and processed as key-value pairs, and then the processed data is grouped and merged according to the key, and finally the desired data is generated. desired result.
♦ Computing layer: MapReduce computing cluster consists of two types of nodes: a Master node and a group of Worker nodes. The Master node is responsible for coordinating the entire computing process, including division of tasks, monitoring of Worker nodes, and data transmission. Worker nodes execute the tasks assigned to them and return the results to the Master node. Each node in the compute layer is either a physical computer or a virtual machine, and all can communicate system-wide.
♦ Storage layer: The MapReduce storage layer uses Hadoop Distributed File System (HDFS) to store large amounts of data. HDFS is a scalable and fault-tolerant file system that can replicate data to different nodes to ensure data reliability. HDFS provides efficient data storage and management methods for MapReduce.
The above are the three levels of MapReduce software architecture. MapReduce enables efficient processing of large-scale data by breaking data into small pieces and distributing tasks among computing nodes.
(3)YARN
Apache Hadoop YARN (Yet Another Resource Negotiator, another resource coordinator) is a new resource management and application scheduling framework evolved from Hadoop 0.23. Based on YARN, various types of applications can be run, such as MapReduce, Spark, Storm, etc. YARN no longer specifically manages applications. Resource management and application management are two loosely coupled modules.
In a sense, YARN is a cloud operating system (Cloud OS). Based on this operating system, programmers can develop a variety of applications, such as batch processing MapReduce programs, Spark programs, and streaming job Storm programs. These applications can simultaneously utilize the data resources and computing resources of the Hadoop cluster.
YARN (Yet Another Resource Negotiator) is a resource manager in the Hadoop ecosystem. Its main function is to manage and schedule cluster resources in a unified manner, allocate cluster resources for multiple applications, and improve the resource utilization of Hadoop clusters. . As an important component of Hadoop 2.0, YARN greatly expands the application scenarios of Hadoop and supports multiple computing models, including MapReduce, Spark, and Storm.
The main functions of YARN include:
1. Resource management: YARN can manage the resources of different nodes in the cluster and allocate resources to different applications to ensure the normal operation of the applications.
2. Scheduling management: YARN can schedule cluster resources according to the specified policies and rules according to the requirements of different applications, and ensure the fair sharing of resources by applications.
3. Application management: YARN can automatically manage the lifecycle of applications, including operations such as application startup, monitoring, restart and shutdown.
4. Security management: YARN can provide powerful security management functions, including user authentication, authorization, and data encryption, to ensure the security and stability of the cluster.
In short, as an important component in the Hadoop ecosystem, YARN provides reliable resource management and scheduling functions for the operation of multiple applications, and is widely used in various industries such as the Internet, finance, and medical care.
(4)HBase
HBase is an important non-relational database in the Hadoop platform. It can support PB-level data storage and processing capabilities through linearly scalable deployment.
As a non-relational database, HBase is suitable for unstructured data storage, and its storage mode is based on columns.
HBase is an open source distributed NoSQL database. In the Hadoop ecosystem, it is one of the components of Hadoop database, MapReduce and HDFS. Based on Google's BigTable paper, HBase is a highly reliable, highly scalable, high-performance database designed to run on large-scale data sets. HBase has the following characteristics:
It uses column clusters as the basic storage unit and supports dynamic columns.
Supports automatic partitioning, automatic load balancing, and automatic failover.
Support semi-structured data and unstructured data, there is no fixed table mode.
Supports high concurrent read and write operations, multi-version data, data compression, and data caching.
HBase is highly scalable and can support hundreds of billions of rows of data, each row can have tens of thousands of columns.
HBase is commonly used to store semi-structured data such as logs, social media data, sensor data, network data, images and audio, etc. It can dynamically adjust storage and processing capacity according to needs, and can carry the query and analysis of large-scale data sets, which is an ideal choice for processing big data.
(5)Hive
Apache Hive is a data warehouse built on top of Hadoop architecture. It can provide data refinement, query and analysis. Apache Hive was originally developed by Facebook, and currently other companies use and develop Apache Hive, such as Netflix.
Hive is an open source framework under the Apache Foundation. It is a data warehouse tool based on Hadoop. It can map structured data files into a data warehouse table and provide simple SQL (Structured Query Language) query functions. Statements are converted into MapReduce tasks to run.
Using Hive can meet the needs of some database administrators who do not understand MapReduce but understand SQL, so that they can use the big data analysis platform smoothly.
Hive is a Hadoop-based data warehouse tool. It is a data warehouse infrastructure that can map structured data files into a database table and provides a set of SQL-like query language HiveQL to query data. Hive is designed to facilitate SQL developers to process large data sets. It can convert SQL-like syntax into MapReduce tasks for execution through HQL, thereby utilizing Hadoop clusters to process massive data.
Hive supports multiple data sources, including HDFS, HBase, and local file systems. It can store massive data through built-in data storage formats, such as text, serialization, ORC, etc., and provides features such as data compression, data partitioning, and data buckets to optimize performance.
Hive has a good scalability and ecosystem. It can extend functions through UDF (user-defined function) and UDAF (user-defined aggregate function), and supports the integration of many third-party tools, such as JDBC, ODBC, Tableau, etc.
In short, Hive is a powerful data warehouse tool, which has great practical value for scenarios that need to process large amounts of data.
(6)Kafka
Apache Kafka is a distributed "publish-subscribe" messaging system, originally developed by LinkedIn and later became an Apache project. Kafka is a fast, scalable, and inherently distributed, partitioned, and replicated commit log service by design.
Kafka is a distributed system that is easy to scale out. It can provide high throughput for publishing and subscribing, and supports multiple subscribers. When it fails, it can automatically balance consumers. Real-time business can also be oriented to real-time business.
Apache Kafka is a distributed stream processing platform developed by the Apache Software Foundation, which has the characteristics of high reliability, high scalability and high throughput. Based on the publish/subscribe model, Kafka is mainly used to record streaming data, such as logs, events, and indicators.
Kafka's architecture includes the following components:
Broker: Each node in the Kafka cluster is called Broker and is responsible for storing and processing data.
Topic: Data records are stored in one or more Topics. Each Topic is divided into multiple Partitions, and each Partition can be distributed on different Brokers.
Producer: The Producer is responsible for sending data to the Topic in the Kafka cluster, and can specify which Partition the data is sent to.
Consumer: Consumer subscribes to data from Topic in Kafka cluster and processes the data. Consumers can form Consumer Groups, and Consumers in each Group jointly consume data in one or more Partitions.
Kafka is widely used in various scenarios, such as log collection, real-time data stream processing, event-driven architecture, etc. It is closely integrated with open source technologies such as Hadoop and Spark, and has become an indispensable part of the big data ecosystem.
(7)Storm
Storm is a free, open source, distributed, and highly fault-tolerant real-time computing system. It can handle continuous flow computing tasks. At present, it is widely used in real-time analysis, online machine learning, ETL and other fields.
Storm is an open source distributed real-time computing system, mainly used to process a large amount of streaming data. It can acquire data in real time, process it, and send the processed data to other systems. Storm is highly scalable, fault-tolerant, and reliable, and can run in distributed clusters.
The core concept of Storm is Topology. A Topology is a way of data stream processing, which consists of Spout and Bolt. Spout is a component used for data source input, which is responsible for inputting data into the topology. Bolt is a component used for data processing and data transmission. It will receive the data sent by Spout, process it and then pass the processed data to the next A Bolt or Sink. Each bolt in the topology can run in parallel to make data processing more efficient.
Storm also has a built-in fault tolerance mechanism, which can automatically restart or switch to other nodes to run when a cluster node fails, realizing highly reliable distributed computing. At the same time, Storm also supports multiple data sources (such as Kafka, RabbitMQ, etc.) and data storage (such as HDFS, Cassandra, Redis, etc.), can process different types of data, and store the results in different data storages.
In short, Storm is a powerful real-time computing framework that is widely used in real-time data processing in various industries.
Comparison of Storm and Hadoop
structure | Hadoop | Storm |
---|---|---|
master node | JobTracker | Cloud |
slave node | TaskTracker | Supervisor |
application | Job | Topology |
Worker process name | Child | Worker |
Computational model | Map / Reduce | Spout / Bolt |
Big data articles:
- Big data (1) Definition and characteristics
- Big data (2) Statistics related to big data industry
- Big data (3) Jobs related to big data
- Build big data visualization big screen based on Echarts
- Big Data (4) Mainstream Big Data Technology
Recommended reading:
|
|
|
Tomcat11, tomcat10 installation configuration (Windows environment) (detailed graphics) |
Tomcat startup flashback problem solving set (eight categories in detail) |
|