Analysis of Data Warehouse and OLAP

database

1. What is a data warehouse?

  • Official explanation
    Data warehouse, the English name is Data Warehouse, data warehouse is a subject-oriented (Subject Oriented), integrated (Integrated), relatively stable (Non-Volatile), reflecting historical changes (Time Variant) Data collection, used to support management decision (Decision Making Support).
    -"Building the Data Warehouse" published by Bill Inmon in 1991 ("Building the Data Warehouse"

(1) Subject-oriented

The data in the data warehouse is organized according to certain subject areas.

A theme is an abstract concept, which refers to the key aspects that users care about when using a data warehouse to make decisions. A theme is usually related to multiple operational information systems. The data organization of operational databases is oriented to transaction processing tasks, and each business system is separated from each other.

(2) Integrated

The data in the data warehouse is obtained through systematic processing, summary and sorting on the basis of extracting and cleaning the original scattered database data. The inconsistencies in the source data must be eliminated to ensure that the information in the data warehouse is about the entire enterprise Consistent global information. Transaction-oriented operational databases are usually related to certain specific applications. The databases are independent of each other and are often heterogeneous.

(3) Relatively stable

Data warehouse data is mainly used for enterprise decision analysis. The data operations involved are mainly data queries. Once a certain data enters the data warehouse, it will generally be retained for a long time, that is, there are generally a large number of query operations in the data warehouse. , But there are few modification and deletion operations, usually only regular loading and refreshing. The data in the operational database is usually updated in real time, and the data changes in time as needed.

(4) Reflect historical changes

The data in the data warehouse usually contains historical information. The system records the information of the company from a certain point in the past (such as the time when the data warehouse is applied) to the current stage. Through this information, the development process and future trends of the company can be Make quantitative analysis and predictions. The operational database is mainly concerned with the current data in a certain period of time.

2. Data warehouse architecture

  • Data warehouse construction process
    Insert picture description here

(1) DB (abbreviation for Database)

Before talking about databases, let’s clarify a concept and relationship between databases and database management systems.
We often talk about databases such as MySQL, SQL-Server, Oracle, etc. In fact, this name is wrong because these can be called They are all software. Their correct addition should be RDBMS (Relational Database Management System). Note that because of the data relational databases that I introduced earlier, we added them to the relational relationship. Type of title, because of the relational type, there must be non-relational database management systems such as Hbse, MongoDB, Redis and so on.

We can refer to all these database management systems collectively as DBMS (Database Management System)

Let's take a look at the structure of these database management systems
Insert picture description here

Compare items Traditional database database
Data content The current value Historical, archived, integrated, calculated data (processed)
Data target For business operation procedures, repeated operations Subject-oriented, analytical application
Data characteristics Dynamic changes and updates Static, cannot be updated directly, can only be added and updated regularly
data structure Highly structured, complex, suitable for operation calculation Simple and suitable for analysis
usage frequency high low
Data access A few records of general access to everything Each transaction generally accesses a large number of records
Response time requirements Small timing units, such as seconds or even milliseconds The timing unit is relatively large, such as minutes, hours, etc.

(2) ETL (extract, transform, load)

It is easy to explain from the name that the function of this process is similar to the RAW (read and write) process, but there is an additional conversion step in the middle, and
there may be questions about why this step is necessary, then I will use a practical example first. , Such as the case below:
Insert picture description here

(3) ODS (Operation Data Store operation data layer)

As the name implies, according to the construction flow chart of the previous data warehouse, you can see the location of this layer and you can know that the data in this layer is clean data cleaned by ETL. Compared with the original business database (DB level) ), the definition of this layer of data has been unified, and the missing data has been supplemented. The three links we saw before are actually what the organization structure of the central and Taiwanese layer does.

(4) DW (Data Warehouse)

Through the previous steps, we have obtained relatively clean data. Now we need to classify this series of data, which is actually modeling. We need to put forward the attributes used to describe a thing in these data as a dimension table (Dim), some of which express the specific output value of the business and so on, we use them as measurement values ​​that are placed in the fact table (fact).

  • Star model
    Insert picture description here

  • Snowflake model
    Insert picture description here

The data warehouse will include one or more fact tables. There are two types of "measurement values" contained in the fact tables, one is the cumulative measurement value, and the other is the non-cumulative measurement value.

From the perspective of different purposes, fact tables can be divided into three categories, namely transaction facts, periodic snapshots, and cumulative snapshots.

Dimension : It is the specific angle from which people observe the data. It is a type of attribute when considering the problem. The attribute set constitutes a dimension, usually with dimensions such as date and region.

Slicing : A technique used to limit the analysis space in a dimension to a subset of data in a data warehouse.

Dicing : A technique used to limit the analysis space in multiple dimensions to a subset of data in a data warehouse.

Star pattern : It is the best design pattern for data warehouse applications. It is named because it is physically represented as a central entity. Typical content includes indicator data and radiation data, which are usually dimensions that help browse and aggregate indicator data. The result obtained by the star graph model is often a query data structure, which can provide an optimal data structure for quickly responding to user queries. Star charts often produce a two-layer model that includes dimensional data and indicator data.

Snowflake mode : refers to an extended star diagram. Star diagrams usually generate a two-layer structure, that is, only dimensions and indicators, and snowflake diagrams generate additional layers. In the actual data warehouse system construction process, usually only three layers are expanded: dimensions (dimension entities), indicators (index entities) and related description data (category detail entities); the snowflake graph model with more than three layers should be used in the data warehouse system avoid. Because they are beginning to be more inclined to support the standardized structure of OLTP applications, rather than the unformatted structure optimized for data warehouse and OLAP applications.

Granularity : Granularity will directly determine the level of detail that the built warehouse system can provide decision support. The higher the granularity, the coarser the data in the warehouse, otherwise, the finer. Granularity is related to specific indicators, and is specifically manifested in the dimension values ​​of some separable hierarchical dimensions describing this indicator. For example, the time dimension, time can be divided into year, quarter, month, week, day, etc.

The granularity of the data stored in the data warehouse model will affect many aspects of the information system. What level of the various dimensions in the fact table is the finest granularity will determine whether the stored data can meet the functional requirements of information analysis, and the level of granularity and the choice of granularity in the aggregate table will directly affect the response time of the query.

Measures : In a cube, a measure is a set of values ​​based on a column in the fact table of the cube, and it is usually a number. In addition, the metric value is the center value of the cube being analyzed. That is, the metric value is the numerical data (such as sales, gross profit, cost) that the end user looks at when browsing the cube.

(5) DM (Data Mart Data Mart)

Partial dw built with a specific business application as the starting point, dw only cares about the data it needs, and does not consider the overall data architecture and applications of the enterprise.
To be honest, the definition of this level is actually relatively vague. You can just look at it based on his name. The market is actually the nearest data to our ordinary users. For example, if you usually buy things in a store, you must also You can only see the products displayed on the shelves, and where these products come from, you don’t have to come from the warehouse of the supermarket (corresponding to our data warehouse). Ordinary users cannot directly enter our warehouse to get things. , Here is the same reason.

Fact: It is an information unit in a data warehouse and a unit in a multi-dimensional space, limited by an analysis unit. The facts are stored in a table (when using a relational database) or a unit in a multidimensional database. Each fact includes basic information about the fact (sales, sales volume, cost, gross profit, gross profit margin, etc.) and is related to dimensions. In some cases, when all necessary information is stored in dimensions, the mere fact that appears is enough information for the data warehouse.

OLAP

  • Official explanation:
    OLAP (Online Analytical Processing)
    enables analysts to quickly, consistently, and interactively observe information from all aspects to achieve the purpose of in-depth understanding of data. It has the characteristics of FASMI (Fast Analysis of Shared Multidimensional Information), which is the rapid analysis of shared multidimensional information. Among them, F is Fast, which means that the system can respond to most analysis requirements of users within a few seconds; A is Analysis, which means that users can define new special calculations without programming and use them as analysis. The report is given in the way the user wants; M is Multi-dimensional, which refers to providing multi-dimensional views and analysis of data analysis; I is Information, which refers to the ability to obtain information in time, and Manage large volumes of information.
  • The difference between it and OLTP:
    OLTP (Online Transaction Processing) is also called transaction-oriented processing. Its basic feature is that the user data received at the front desk can be immediately transmitted to the computing center for processing, and the processing result will be given in a short time , Is one of the ways to quickly respond to user operations.
Comparison item OLTP OLAP
user Operators, low-level managers Decision makers, senior managers
Features Simple affairs, daily operation processing Complex query, analysis and decision making
DB design Application-oriented, business-oriented database Subject-oriented, analytical data warehouse
data Read/write dozens of records Read millions of records
Effectiveness real-time Not strict requirements for effectiveness
application database database

Apache Kylin

Apache Kylin™ is an open source distributed analysis engine that provides SQL query interfaces and multidimensional analysis (OLAP) capabilities on Hadoop to support ultra-large-scale data. It was originally developed by eBay Inc. and contributed to the open source community. It can query huge Hive tables in sub-seconds. The secret to providing low latency (sub-second latency) is pre-computation, that is, for a data cube with a star topology, pre-computing metrics for multiple dimensional combinations, and then saving the results in hbase, exposing JDBC, The query interface of ODBC and Rest API can realize real-time query.
Insert picture description here

As shown in the figure above, Kylin obtains data from Hadoop Hive, and then uses Cube Build Engine to build the data in Hive into an OLAP Cube and save it in HBase. When the user executes an SQL query, the SQL statement is parsed into an OLAP Cube query through the Query engine, and then the result is returned to the user

cube

That is, after building the corresponding model in Kylin, Kylin analyzes the model, then pulls the corresponding data from the corresponding underlying data warehouse, arranges and combines each dimension, and then corresponds these different dimension combinations. The aggregate value of the measure is pre-calculated, and the following data cube is established. After
Insert picture description here
we arrange and combine different dimensions, we aggregate the calculated measure value and we will store it in Hbbase (a column cluster database). The reason for using a column cluster database is Because the amount of data is very large, and often our needs are through the SQL passed in by the user, we analyze the logic and filtering conditions in the SQL, and we can quickly find us in Hbase through these conditions in a mapping way. Just take out the calculated value directly. And if we use a row indexed database, we will increase a lot of unnecessary query time overhead when looking up. The ideal state of the cube is to allow all the user's SQL to be directly taken from our pre-computed collection, rather than recomputing. Of course, this is ideal. In practice, we only need to analyze frequently used query SQL, and then extract the conditions in it into pre-calculation. We usually can take out a SQL from the cube without calculation. Such a situation is called a hit, and the higher the hit rate of a cube, the better. This is a function of Kylin cube.

Guess you like

Origin blog.csdn.net/qq_42359956/article/details/109256831