Introduction to Zhixing Education Project

Project 1: Big Data Data Warehouse Project

Project Name: Zhixing Education Digital Warehouse Project

Project structure:

Data source: data of OLTP systems such as teleconsulting, offline teaching, online education, etc., are mostly stored in mysql.
Data extraction: Use sqoop to achieve two-way synchronization between relational databases and big data clusters. Data storage: HDFS
data cleaning: data cleaning, conversion, statistical analysis, etc. are all performed using Hive based on CM management.
Data analysis: Data cleaning, conversion, statistical analysis, etc. are all performed using Hive based on CM management.
Data synchronization: Use sqoop to achieve two-way synchronization between relational databases and big data clusters. OLAP data service: The commonly used Mysql database is used.

Insert picture description here

Development environment:
jdk: Jdk1.8
Scala: 2.11.8
CDH6.2.1: zookeeper-3.4.5-cdh6.2.1, hadoop-3.0.0-cdh6.2.1, hive-2.1.1-cdh6.2.1, hive-4.3 .0-cdh6.2.1
Sqoop: sqoop-1.4.7-cdh6.2.1
Mysql: 5.7
Zeppelin: 0.8.0

project description:

受互联网+概念的催化,教育市场发展火热,越来越多的教育机构和平台不断涌现,包括有线上学习和线下培训,K12教育和职业教育等,那些注重用户服务、教育质量的平台会最终胜出。目前的企业痛点:
1.数据量大,现有MySQL业务数据库直接读取模式不能满足业务统计性能、效率需要
2.系统多、数据分散,缺少从营销、咨询、报名、教学等等完整业务环节的数据贯通
3.统计分析难度高、工作量大。缺少元数据、数据集合的规范存储,业务部门有数据分析角度需求时,需要程序员、DBA突击查数据、做报表,尤其年底各个部门排队等DBA协助出数据 		如何提高用户服务水平,提高教育质量是每个机构都面临的问题。信息的共享和利用不充分,就导致尽管学校多年的信息化应用积累了大量的数据,但信息孤岛的壁垒一直没有打破,对这些数据无法进一步的挖掘、分析、加工、整理,不能给学校教育、教学、研发、总务等各方面管理决策提供科学、有效的数据支撑。

The application of big data technology can mine and analyze massive user behavior data, optimize the service quality of the platform according to the analysis results, and finally meet the needs of users. The educational big data analysis platform project is to apply big data technology to the field of education and training to provide data support for business operations:
1. Establish a group data warehouse, unify the group data center, and pre-process and store dispersed business data
2. According to the business Analyze needs, conduct mining and analysis from massive user behavior data, customize multi-dimensional data collections to form data marts for use in various scenarios and topics
3. Front-end business data display selection and control, select appropriate front-end data statistics and analysis results display tool

Project requirements:

4. Online Education Business Requirements
4.1 Access and Consultation User Data Kanban
4.2 Intentional User Kanban
4.3 Effective Cue Kanban
44 Registration User Kanban
4.5 Student Attendance Kanban

Responsibility description:
1. Participate in the preliminary project analysis, design the overall architecture of the system
2. Data acquisition design, real-time processing part design
3. Strom write the substantial meaning of
Hbase batch write design 4. Hbase and incremental docking scheme design, hbase secondary index , Paging scheme design
5. Hive data warehouse design and maintenance, data subject extraction, data dimension analysis

Data warehouse introduction:
Snow model:
When one or more dimension tables are not directly connected to the fact table, but through other dimension tables When connected to the fact table, it is like multiple snowflakes connected together, so it is called the snowflake model.
Insert picture description here
Kanban one introduction:

Accessing and consulting user data board The subject of
customer access and consulting, as the name suggests, the data analyzed is mainly the customer's access data and consulting data. But after demand research, the visit data here actually refers to the number of customers visited, not the number of customer visits. The original data comes from the mysql business database of the consulting system.

There are two core indicators: the number of visiting customers and the number of consulting customers

Dimensions include: year, quarter, month, day, hour (hour segment within the day interval), region, source channel, search source, session source page, and total visits.

The overall process:

Insert picture description here

Advantages and disadvantages of incremental schemes:

Problem
we DWS is included year, quarter, month, and other dimensions of the resulting data
due to the added day's data, the current year, the current quarter, the current month's data is the result of the failure
of the
need to recalculate the
question is: DWS table with failure How the data is processed.
Problem solving method 1
delete expired data
Advantages:
BI-friendly, no historical data confusion, directly fetch the latest
data in the table is clear
FROM (SELECT * FROM itcast_ods.web_chat_ems WHERE start_time='${DATESTR}') AS w1 INNER JOIN itcast_ods.web_chat_text_ems AS w2 ON w1.id = w2.id;" 6768
Disadvantages:
complex implementation,
breaking the principle of not deleting the data warehouse design as much as possible.
Method 2:
Adding new columns and table names when the current data calculation time is
used , Just take the latest time.
Advantages:
The changes of historical results are stored in the table.
No deletion will be performed and the principle of data warehouse will not be destroyed.
Disadvantages:
For BI analysis, you need to filter the latest data (slightly unfriendly) to
modify the table structure (Full operation needs to be repeated)
Method 3:
Add a new table
A table is generated for the results of each day (one table per day)
Advantages:
Each table is clear and points to the results of a specific day.
Changes in historical results are also recorded through multiple tables.
Disadvantages:
too much data redundancy (as long as business needs, redundancy is not a problem)
unfriendly to BI ( Change a day, change a table, if BI does not support dynamic rule configuration to automatically
change the table, you must change it manually)

Signboard 2:

Advantages and disadvantages of incremental schemes:

Zipper table

Guess you like

Origin blog.csdn.net/xianyu120/article/details/111870894