Learn big data technology and applications again (Lin Ziyu, Xiamen University)
4V concept
Big data - structured data and unstructured data
- Big amount of data
- Fast processing speed (second-level decision-making)
- Low value density and high commercial value
Big data concepts and impacts
Use data as the driving force to discover and solve problems, subverting traditional methods
Full sample non-sampling
Accuracy but not efficiency
Relevant non-causal
Big data applications
The TV series of House of Cards - big data application
Google’s prediction of influenza
Key technologies of big data
1. Data storage
Distributed storage
google technology
2. Data processing< a i=4> Distributed processing Different needs:
Batch processing
mapreduce / spark
real time calculation
Stream computing (real-time) S4
graph calculation
Pregel Graphx
Interactive calculation (query calculation)
google Dremel hive
Big data and cloud computing
**虚拟化与按需服务**
公有云
私有云
混合云
**三个层次:**
Iaas Paas Saas
Two major data processing architectures hadoop
apache project
developed using Java language
Two cores originated from technology provided by Google a>
HDFS + Mapreduce
High reliability
Efficient clustering
High scalability
High fault tolerance
Low cost
High performance computing
Data analysis, real-time query, data mining
2.2 hadoop project structure
-
HDFS distributed file storage
-
YARN Resource Management and Scheduling
-
MapReduce offline processing
-
Tez (DAG directed graph calculation, running on yarn, query processing framework)
-
sprk performs calculations in memory to speed up data reading and calculation.
-
Hive data warehouse is used for enterprise decision-making analysis and large amounts of historical data
Convert SQL statements into Mapreduce jobs -
pig stream data processing
pig simplifies processing by using one statement instead of multiple mapreduce statements -
Oozie job flow scheduling system
-
Zookeeper distributed coordination service
distributed lock
cluster management
-
Non-relational distributed database on Hbase Hadoop
-
Flume log collection and analysis
-
Sqoop is used to transfer data between Hadoop and traditional databases
-
Ambari deployment tool
2.3 Installation of Linux and Hadoop
Use
workstation 12 (above version) + unbantu 16.04-destop-am version (Use a higher version, there is a high probability that mysql, etc. will appear Many software versions are incompatible and there are too many error reports. Don’t ask me why I know)
Workstation tutorial on Baidu
Find version 16.04 from Ubuntu KylinPortal URLOfficial website
After downloading, use workstation to start creation New virtual machine
-
Just create a typical
-
Use the CD image file ubantukylin-16.4 version (please ignore the picture version)
-
Set name, account and password related content
-
Virtual machine name
-
It is recommended to set the size to 40 or larger. Set it to a single file for easy deletion.
-
Complete the installation, install the virtual machine, and confirm the hardware related content. Ubuntu Kylin-related content takes a long time.
2.3.2 Install Hadoop
Steps to install hadoop3.1.3Portal
2.4 Deployment and use of Hadoop cluster
Deploy the cluster to take care of the job completion
Cluster hardware configuration NameNode and DataNode
NameNode is equivalent to the directory
datanode storage Data
MapReduce job
jobTracker splits the entire job into multiple small jobs and then coordinates the processing
taskTrackerDeployed on different machines to track and execute small jobs assigned by JobTracker
Secondary cold backup
Most machines use Datanode and TaskTracker for data processing, which need to be configured as follows
The NameNode general manager manages various metadata and provides services. A lot of data is stored in memory.
The required configuration is higher
Hadoop cluster working status
Cluster construction principles
Network topology of the cluster
Connections between racks, connections between racks
Deploy services on cloud platforms