1 year of SQL database experience, but said the data model is worthless? You don't understand data warehouse

In my work, I usually encounter scenarios where Excel is used to directly analyze and report to the report, or more recent Internet analysis work, I will use SQL to fetch data and then use Excel for analysis.

 

When it comes to analysis, BI, data warehouse, data modeling, etc. must be inseparable. Big data platforms such as spark and hadoop are also knowledge that people in this industry understand.

1 year of SQL database experience, but said the data model is worthless?  You don't understand data warehouse

 

However, compared to those architectures and algorithms, the data structure and model are more troublesome for me.

Looking back now, I am still in awe of broad data structures and algorithms. At the same time, I am also fortunate that I have mastered the data structure and algorithm for solving the information field, that is, the data model of the relational database .

If we say that generalized data structures, such as linked lists, balanced trees, and graphs, are the basis of all programming, then understanding the "data structure" of RDBMS, such as paradigm, star, snowflake, large wide table, etc., is the world of information. Foundation. No matter how hard you work, you will not be proficient, but you can solve countless practical problems and bring great psychological accomplishment and satisfaction.

In order to make it easier for everyone to intuitively experience the data model, I will make a debut here, such as comparing the changes in sales caused by price fluctuations before and after Double 11 and Double 12. Share how you will involve the table structure to meet the needs of analysis.

To do a good job of modeling this type of data analysis, it is inseparable from discussing the data models of Kimball and Inmon. The convenience and challenges brought to the project by the two completely different models are also very different.

Of course there are models such as Data Vault and Anchor etc.

Let’s start with the architecture

1 year of SQL database experience, but said the data model is worthless?  You don't understand data warehouse

 

The figure above is a diagram of Inmon's hub architecture. The data warehouse is not a delivery product of Inmon theory, it is just a storage that integrates all key entities and business process data of the enterprise. Faced with the analysis needs of each department, the data warehouse will eventually continue to branch out the data marts required by each business, and all individual businesses will extract data from the assigned data marts.

From this architecture diagram, it is easy to see that the data warehouse is only responsible for collecting data, similar to a hub, but eventually it must be distributed.

Kimball's architecture is different. As shown in the figure below, he also has a large data warehouse, but lacks the concept of a data mart.

1 year of SQL database experience, but said the data model is worthless?  You don't understand data warehouse

 

In Kimball's theoretical model, the data mart is never a formal deliverable, but a natural by-product of the ETL process. That is, when ETL collects business data into staging, it will package the data into an ODS layer (Operational Data Store) according to entities and business processes. Any single business department can query data from ODS. Functionally similar to Inmon's data mart.

When the final data is aggregated into the data warehouse, it naturally has the global attributes of the enterprise. The embarrassment of seeing the trees but not the forest was resolved. For example, in the face of a decline in corporate profits, we can do multi-dimensional analysis from the cost, order volume, and unit price, instead of just focusing on the order volume.

Therefore, Kimball's theory is more about the strategy of data flow from part to the whole. The final deliverable, the data warehouse is like an enterprise data flow bus. Whoever wants it, does not need to switch multiple databases.

Compare the landing of the data model

A colleague once asked me why our tables are designed with many redundant fields instead of strictly following the three-paradigm design? In fact, the answer lies in Kimball's dimensional model. In the Kimball bus architecture diagram, I purposely annotated the schema of the data warehouse with a star model.

It's easy to understand. The star in the middle is directly connected to other stars, and there is only one level of connection. This is the essence of the Kimball data model. The biggest difference with Inmon is here. Inmon's data models are all ER models, and the paradigm is used to the extreme.

Let's look at Kimball's star model dimensional modeling:

1 year of SQL database experience, but said the data model is worthless?  You don't understand data warehouse

 

It is very intuitive. Around the SalesOrder (sales order) business, suppose there are three dimensions (that is, the three factors that affect the order, in fact, there are far more than 3, 300 have, and the Internet even has 3000) Employee, Time, Components, Namely people, goods, and time.

The human dimension also includes the department, address and rank of the person; the time dimension is a simple one. In practical applications, there will be multiple accounting periods, and the time is slightly complicated; the goods dimension is the product, and there are manufacturers , Address, factory director and attributes, size, color, etc. of the product itself.

This is where many beginner students are confused. Why are there a lot of seemingly redundant data in one table? Why not split it out according to the three paradigm? There is a particularly important principle here, that is, space changes time.

When all attributes are used for dimensional analysis, in order to save Join time, these dimensional attributes are usually calculated in advance. Real-time query and analysis, use GroupBy to randomly group statistical data, if there is no suitable index, it will be very slow.

In order to improve efficiency, we can only pre-calculate and store the statistics and aggregation of these combinations. Most OLAP engines are based on this principle, such as SQL Server Cube, Kylin, etc.

Kimball gave this data model a name, "star model". As the final delivered product, it is the soul of the data warehouse.

Kimball theory did not give up on the data mart, but he implemented the data mart in the ETL stage, using another model called the "snowflake model". The function is similar to Inmon's data mart. In fact, the data model is also the same, which is the standard ER model, namely the three-paradigm structure.

1 year of SQL database experience, but said the data model is worthless?  You don't understand data warehouse

 


In the human dimension, only the attributes of the person themselves are retained, such as gender, height, age, etc. Other subsidiary attributes, such as address, department, and rank, are stored in different subtables. The same is true for the other two dimensions. Both reserved attributes and additional attributes are stored separately. Such a disadvantage is that there are more Joins and it is easy to cause slow performance.

So in reality, which theory should we use to design the data warehouse architecture, and which data model to model?

There is no silver bullet in the real world, everything depends on the complexity of the business. Kimball theory is obviously more suitable for BI suites, but leaves the complexity of redundant data processing; Inmon solves the problem of data consistency, but performance is the old and difficult problem.

Guess you like

Origin blog.csdn.net/yuanziok/article/details/109093470