Learn big data technology and applications again (Lin Ziyu, Xiamen University)

Learn big data technology and applications again (Lin Ziyu, Xiamen University)

4V concept

Big data - structured data and unstructured data

  1. Big amount of data
  2. Fast processing speed (second-level decision-making)
  3. Low value density and high commercial value

Big data concepts and impacts

Use data as the driving force to discover and solve problems, subverting traditional methods
Full sample non-sampling
Accuracy but not efficiency
Relevant non-causal

Big data applications

The TV series of House of Cards - big data application
Google’s prediction of influenza

Key technologies of big data

1. Data storage
Distributed storage
google technology
Insert image description here
2. Data processing< a i=4> Distributed processing Different needs:

Batch processing

   mapreduce / spark

real time calculation

Stream computing (real-time) S4

graph calculation

   Pregel Graphx

Interactive calculation (query calculation)

google Dremel hive

Big data and cloud computing

 **虚拟化与按需服务**
 公有云
 私有云
 混合云
 **三个层次:**
 Iaas Paas Saas

Two major data processing architectures hadoop

apache project
developed using Java language
Two cores originated from technology provided by Google a>
HDFS + Mapreduce
High reliability
Efficient clustering
High scalability
High fault tolerance
Low cost
High performance computing

Data analysis, real-time query, data mining
Hadoop general application

2.2 hadoop project structure

Project structure

  1. HDFS distributed file storage

  2. YARN Resource Management and SchedulingYARN resource management and tuning

  3. MapReduce offline processing

  4. Tez (DAG directed graph calculation, running on yarn, query processing framework)
    Tez builds a directed acyclic graph

  5. sprk performs calculations in memory to speed up data reading and calculation.Spark is a Mapreuce parallel framework

  6. Hive data warehouse is used for enterprise decision-making analysis and large amounts of historical dataData warehouse on hadoop platform
    Convert SQL statements into Mapreduce jobs

  7. pig stream data processing
    pig lightweight analysis
    pig simplifies processing by using one statement instead of multiple mapreduce statements

  8. Oozie job flow scheduling system

oozie job flow

  1. Zookeeper distributed coordination service
    Provide distributed coordination and consistency services
    distributed lock
    cluster management
    Insert image description here

  2. Non-relational distributed database on Hbase HadoopHbase very large random read database

  3. Flume log collection and analysis
    Flow log processing and analysis

  4. Sqoop is used to transfer data between Hadoop and traditional databasesHdfs Hbase Hive import each other

  5. Ambari deployment toolDeploy a complete Hadoop suite

2.3 Installation of Linux and Hadoop

Use
workstation 12 (above version) + unbantu 16.04-destop-am version (Use a higher version, there is a high probability that mysql, etc. will appear Many software versions are incompatible and there are too many error reports. Don’t ask me why I know)
Workstation tutorial on Baidu

Find version 16.04 from Ubuntu KylinPortal URLOfficial website
Download the corresponding version
After downloading, use workstation to start creation New virtual machine

  1. Just create a typical

  2. Use the CD image file ubantukylin-16.4 version (please ignore the picture version)Insert image description here

  3. Set name, account and password related contentInsert image description here

  4. Virtual machine nameInsert image description here

  5. It is recommended to set the size to 40 or larger. Set it to a single file for easy deletion.Insert image description here

  6. Complete the installation, install the virtual machine, and confirm the hardware related content. Ubuntu Kylin-related content takes a long time.

2.3.2 Install Hadoop

Steps to install hadoop3.1.3Portal

2.4 Deployment and use of Hadoop cluster

Deploy the cluster to take care of the job completion
Insert image description here
Cluster hardware configuration NameNode and DataNode
NameNode is equivalent to the directory
datanode storage Data
Insert image description here
MapReduce job
jobTracker splits the entire job into multiple small jobs and then coordinates the processing
taskTrackerDeployed on different machines to track and execute small jobs assigned by JobTracker
Insert image description here
Insert image description here
Secondary cold backup

Most machines use Datanode and TaskTracker for data processing, which need to be configured as follows
Insert image description here
The NameNode general manager manages various metadata and provides services. A lot of data is stored in memory.
Insert image description here
The required configuration is higher
Insert image description here
Hadoop cluster working status Insert image description here
Cluster construction principles
Insert image description here
Network topology of the cluster
Connections between racks, connections between racks

Deploy services on cloud platforms

Guess you like

Origin blog.csdn.net/huangdxian/article/details/120734446