OLAP engine so much, why sack wealth choose to do self-analysis Kylin?

Background of the project

Sacks wealth (formerly Money sacks) was founded in the end of December 2014, CITIC Industrial Fund Holdings lending network information intermediary platform, after four years of steady and rapid development, as of now, the total transaction amount up to 75 billion, has become the industry head platform. The huge amount of data traffic brought about exponential growth, the original data analysis has been far can not meet the needs of the business:

The process is time-consuming: the logic is more complex data requirements, may be involved in the development, product managers, BI and other multi-stakeholders, through repeated communication, confirmed to complete, but too many people involved increased communication costs, lengthen the project cycle

Waste of resources: In order to promote sales growth platform, operators can design a variety of product promotion or user of short-term activities to promote living after each activity will be re-set, not the product of the activity analysis often results in a waste of manpower analyst

Cluster pressure: some complex indicators require long-term monitoring should be carried out repeated queries every day, and every day hundreds of temporary SQL submitted to the cluster processing, resulting in cluster computing pressure, impact cluster performance

Slow query: With the increasing amount of data, often take several minutes to a SQL aggregation results, data analysts need to respond quickly to not meet

In response to these pain points, we are looking for a tool capable of providing efficient, stable and convenient data analysis capability to the user.

Why Kylin

We investigated the mainstream market OLAP engine, contrast details as follows:

Combined with the company's business needs:

T + 1 to the main data offline

Tableau can be integrated use to achieve self-analysis

Common dimensions of about 30, about 100 index, any combination of the cross, covering 80% of the fixed and temporary needs +

Business users need to observe from the side into the left lifecycle characteristics, it involves the amount of data, but fast response

We chose Kylin as OLAP analysis engine for the following reasons:

Kylin use precomputed, space for time, enables a user query request second response

Can be combined with existing BI tools --Tableau, to achieve self-analysis

That would otherwise require time-consuming demands of the week results in a few minutes, the development efficiency by more than 10 times

In this paper, sacks wealth based on self-analysis of large data platform CDH project implementation, how Apache Kylin applied to the actual scene, using the current situation and future work on Kylin ready to do.

Technology Architecture

Deployment system, the main points on the production environment and pre-wired environment, the production environment is responsible for analysis, calculated from the running production clusters Hive, the pre-computed results are stored in HBase. If you want to add a Cube, then, we need to analysts in the pre-operation on-line environment, and then migrate from the hand of the Cube to optimize the production environment.

Sacks wealth of self-analysis architecture as shown below:

Data synchronization: Sqoop (offline scenario), Kafka (near real-time scene)

Data source: Hive (offline scenario), Streaming Table (near real-time scene)

Calculation Engine: MapReduce / Spark

Expected result storage: HBase

Self-analysis tool: Tableau

Scheduling system: Azkaban

Apache Kylin solutions

The company's business is very complex, data team will be highly abstract business needs, determine the dimensions and good measure, just to build a Cube, based on the Cube, to form a common platform of data products, data analyst liberation, reduce repetitive work.

Kylin's built offline

(1) Data Modeling

Data modeling is the most important work for the implementation of Kylin. Generally use a relational database model star model, but in reality due to the large base of diversity, dimension tables of business, so generally we have to deal with wide tables and table width table built OLAP model based on a wide table can not only solve data data granularity problem modeling, multi-table join can solve performance problems, and dimensional change, or ultra-high-cardinality dimension problems.

Different characteristics of the various lines of business data and business characteristics determine the use of scenarios and models designed to optimize the way Kylin:

The size and characteristics of the data model. From the data in terms of scale, the amount of data wide tables nearly billion, more than ten million daily incremental data. We highly abstract analysis through OLAP modeling based on business indicators, the definition of the dimensions and metrics and the relationships underlying data granularity.

Dimensions base characteristics. Ideally dimension base is relatively small, but in fact some dimensions than one million nearly ten million, and this business needs and industry characteristics related. In addition to the use index between the Cartesian product of the combination of some dimensions, resulting in difficult OLAP simplified model, the corresponding overhead data to generate relatively large.

Features size dimensions. From the point of view of dimension, regional dimension contains provinces and cities; need to be divided on the day of a time dimension, increasing the complexity of dimensions.

Indicators are also characteristic dimensions. There are a number of indicators also measure both dimensions, such as: we need to analyze the investment amount is 0 user behavior, they need to calculate the amount of user investment, so the cast is the dimension is a measure of the amount.

(2) Kylin Cube Design

From Kylin Cube model, because the Cube need to meet the needs of a variety of scenarios, you need a flexible combination of multiple dimensions to each other, repeated communication with the business analysts, the final confirmation of the dimensions and metrics Cube.

Cube model overview:

19 dimensions: including the provinces, operating systems, device models, gender, tie card status, such as investment grade

10 metrics include: the amount of data, number of visits, number of registered users, page views, the amount of investment, the annual amount, etc.

Incremental build: Cube a source of data in increments of three million, Build finished one day Cube data size 87.79GB

Optimization (3) Cube Design

Cube Build process is commonly encountered performance issues such as slow SQL queries, Cube building for too long or even failure, Cube expansion rate is too high, and so on. The reason, most of the problems are due to improper design Cube. Therefore, reasonably Cube optimization is particularly important.

Optimization:

Dimensionality reduction: Removal of the dimension does not appear in the query, such as data creation date

Forced dimensions: the dimensions of each inquiry needed to force the dimension (Mandatory Dimensions)

Level dimensions: the dimension hierarchical (provinces or date) is set to level dimension (Hierarchy Dimensions)

Joint dimensions: the combination of dimensions provided to the user concerned joint dimensions (Joint Dimensions)

The polymerizable adjustment: a plurality of polymeric, plural sets of each joint dimensional aggregation. At the same time dimension does not appear in the query are contained in different aggregation.

Adjustment Rowkeys sorting, for high base dimension, if the dimension in a filter, queries, on the front, in front of the common dimensions.

(4) Cube optimize outcomes

According to the above optimization program, the source assist_date and set mandatory dimensions, the province, city-level dimension is set, then according to frequency of use and the base level of sorting, the final optimization results are as follows:

Query performance: second response

Build time: Reduce 31%

Cube size: 42% decrease

Query Performance details: Business Schedule: 1000000000

SQL statement: seeking investment amount in each city

Kylin real-time incremental builds

In order to reduce latency OLAP analysis, add the Streaming Table realize the function in quasi-real-time analysis of the Kylin, Kylin to Streaming Table for the data source, data Streaming Table consumption of Kafka. Multiple model types add a timestamp field as a time series. In practice, the model optimizes the following parameters:

kylin.Cube.algorithm=inmem

kylin.Cube.algorithm.inmem-concurrent-threads=8

kylin.Cube.max-building-segments=600

Kylin integrate Tableau

The adoption of Kylin 2.4.0 version and Tableau 9.0 release, available under the pre-computed results in the former premise, combined with Tableau hope to provide more convenient and rapid self-analysis data to the data analyst.

Tableau corresponding version Installation on the machine Kylin ODBC Drive, selected Tableau Kylin Kylin connection of ODBC Driver, and then select the data source Kylin and Fact Table Join Table, press Kylin Cube model join together, the results can be achieved drag ad hoc query, drill, drill, rotary and other targets. Analysts get rid of the lengthy preparation of SQL, the process of a long wait, the data can be analyzed according to their needs. One usage scenario shown below, shows the number of active each region:

Experience in the implementation of

1) Tableau slow drag dimension results

Solution: Check kylin.log, found the longest is select * from fact, so let this defeat as quickly as SQL, you can modify the parameters of kylin.properties:

kylin.query.max-scan-bytes is set to a smaller value

kylin.storage.partitin.max-scan-bytes set to a smaller value

2) Kylin integrating Tableau create calculated fields must be included in the Cube, Cube if this calculation does not include field, then the calculation will be displayed in Tableau communication error, because the Cube does not contain pre-computed value.

3) error when using real-time increments:

Solution: This is due to the Kylin 2.4.0 version 3.0.0 version and Kafka does not match, put down a Kylin Kylin 2.3.2 version can be.

4) field type conversion: When you double type data into a String, reservations will be automatically converted to a decimal string, such as 112 converted into 112.0, leading to join when success can not join.

Solution: When we want to convert the decimal value is not only shaping, we can put into a numeric conversion type bigint type, use the value stored in the type bigint not use scientific notation.

Data 5) is inclined null value generation: the behavior of visitors user_id table is nulled, wherein if the associated take user_id user_id and user tables, data skew encounter problems.

Solution: The user_id becomes null string with a random number, the inclination data assigned different reduce, since the correlation value is not null, the post-processing does not affect the final result.

6) Kylin ODBC Driver installation is a version of Tableau, not based on the operating system based on. For example, windows of the 64-bit version, the Tableau version 32, 32 need to install the ODBC.

future plan

Kylin has brought us a lot of convenience, saving time and effort queries. As technology advances, there are many problems to be solved, but also continue to explore and optimizations such as query support for Kylin detail data is not ideal, but sometimes need to query the detailed data; delete Cube, HBase tables are not automatically deleted impact query performance, you need to manually clean-up.

Guess you like

Origin blog.csdn.net/oZuoLuo123/article/details/88962554