From offline to real-time: Wuxi Xishang Bank’s data warehouse evolution practice based on Apache Doris

The open source China community team made its first live broadcast, telling the story behind the open source China community in the name of sharing."

Author: Wu Jipeng, big data technology manager of Wuxi Xishang Bank

Editing and finishing: SelectDB technical team

Introduction: In order to realize the value transformation of data assets and comprehensive digital and intelligent risk management, the big data platform of Wuxi Xishang Bank has experienced the evolution from Hive offline data warehouse to Apache Doris real-time data warehouse, and currently has access to hundreds of real-time tables. , hundreds of data service interfaces, and the interface QPS reaches millions of levels, solving the problems of insufficient timeliness, high cost, and low efficiency of offline data warehouses, speeding up queries by more than 10 times, and providing users with timely, effective, and secure data services and Use experience.

Faced with the changes brought to the financial industry by emerging technologies such as big data, the Internet of Things, and artificial intelligence, Wuxi Xishang Bank places an important emphasis on the development of technological capabilities and big data capabilities. In order to realize the value transformation of data assets and comprehensive digital and intelligent risk management, Wuxi Xishang Bank established a big data platform based on the three-wing integrated technology layout of "online business, data-based risk control, and platform-based architecture". To manage the massive inflow of transaction records and credit application data every day, and with the help of user portraits, real-time reports, real-time risk control and other applications, it provides users with more timely, effective and secure data services and user experience.

The big data platform of Wuxi Xishang Bank has evolved from an offline data warehouse based on Hive to a real-time data warehouse based on Apache Doris . Through the upgrade of the architecture, the problems of insufficient timeliness, high cost, and low efficiency of the offline data warehouse have been solved, and the query speed has been increased by 10 times, enabling banks to perceive customer behavior faster, gain timely insights into abnormal transaction behaviors, and identify and prevent potential risks. . This article will introduce in detail the evolution of Wuxi Xishang Bank’s big data platform and the implementation of Apache Doris in real-time query, marketing services, risk control services and other scenarios.

Big data offline data warehouse based on Hive

01 Demand scenario

Wuxi Xishang Bank built a big data offline data warehouse in the early stage, which mainly serves scenarios such as data reporting, data risk control, data operations, ad hoc queries and daily data retrieval. Demand scenarios include but are not limited to:

Data reporting: customer risk, EAST reporting, 1104, large concentration, credit reporting, interest rate reporting, anti-money laundering, basic financial data reporting, etc.
Data risk control: including risk control on loan risk control indicators, user behavior indicators, anti-fraud, post-loan early warning, post-loan management and other risk controls.
Data operation: Provide regular batch data for BI business reports, management cockpit, external channels and various systems within the industry.
Ad hoc query and daily data retrieval: perform data analysis, data development and data extraction according to business needs.

02 Architecture and pain points

In early offline data warehouses, data mainly came from Oracle, MySQL, MongoDB, Elasticsearch and files. By using tools such as Sqoop, Spark, external data sources, and Shell, the data is extracted offline into the Hive offline data warehouse, and processed hierarchically through ODS, DWD, DWS, and ADS in Hive. The final output results provide support for the application service layer. .

Architecture and pain points.PNG

In recent years, with the development and expansion of Wuxi Xishang Bank's business, relevant business departments have increasingly higher requirements for data processing. The offline data warehouse can no longer meet the new needs, which is mainly reflected in:

Insufficient data timeliness: The offline data warehouse uses an offline extraction solution, and the data timeliness is T+1. However, reports, data dashboards, marketing indicators, and risk control variables require real-time data updates, which the current architecture cannot meet.
Data query efficiency is low: query response at the second level and millisecond level is required. Offline data warehouse execution engines are mainly Hive and Spark. When Hive executes, it will decompose the query into multiple MapReduce tasks, and needs to read and write data in HDFS. The execution time is generally at the minute level, which seriously affects the query efficiency.
High maintenance costs: The bottom layer of the offline data warehouse involves many technology stacks, including LDAP, Ranger, ZooKeeper, HDFS, YARN, Hive, Spark and other systems, which will lead to high system maintenance costs. Although there are also real-time storage and services of HBase + Phoenix online, it still cannot completely solve the current problem because its components are relatively "heavy", the community is not active, and some features cannot meet the needs of real-time scenarios.

Technology selection

Faced with the pain points of insufficient timeliness of offline data warehouses, low query efficiency, and high maintenance costs caused by multiple technology stacks, the construction of real-time data warehouses is imperative. After conducting in-depth research on multiple MPP databases, Wuxi Xishang Bank decided to build a real-time data warehouse platform with Apache Doris as the core. This technology selection aims to ensure that the platform can meet the high requirements of real-time business analysis at the data writing, query and service levels. The reasons for choosing Apache Doris are as follows:

Efficient data update: Apache Doris Unique Key supports large batch data updates, small batch data real-time writing, and lightweight table structure modifications. Especially when processing a large amount of data and partitions, it can effectively avoid the problem of huge amounts of modifications and inaccurate modifications, thereby providing more convenient and real-time data updates.
Low-latency real-time writing: supports real-time writing, updating, and deletion of data at the second level; supports primary key table model write-time merging, enabling high-frequency real-time writing of micro-batches; and supports primary key model Sequence column settings to ensure data import Orderliness in the process.
Excellent query performance: Apache Doris has powerful multi-table Join capabilities. Relying on the vectorized execution engine, CBO query optimizer, MPP architecture, intelligent materialized views and other functions, it can achieve millisecond-level query response for massive data, satisfying instant data queries. Require. At the same time, Apache Doris version 2.0 supports mixed storage of rows and columns, and can achieve tens of thousands of concurrent millisecond-level responses in point query scenarios.
The platform is extremely easy to use: it is compatible with the MySQL protocol and provides rich API interfaces, which can reduce the difficulty of using upper-layer applications. At the same time, Apache Doris has a streamlined architecture, with only two processes, FE and BE. It makes node expansion and contraction simple, cluster management and data copy management support automation. It has the characteristics of simple deployment, low usage cost and low operation and maintenance cost.

Introducing Apache Doris to build a big data real-time data warehouse

In April 2022, Wuxi Xishang Bank introduced Apache Doris to build a real-time data warehouse platform. Considering that the scale of bank data is very large, it is difficult to synchronize the full amount of historical data from the business database while accessing real-time data. Therefore, the initial real-time data construction mainly relies on offline data.

First, the HDFS Broker method is used to efficiently initialize historical real-time data; at the same time, the collection tool DataPipeline is used to collect the data into the Kafka cluster in real time, and then Flink writes the hard-coded mode to write the data into Apache Doris in real time. Finally, with the help of the Feiliu platform's interface service capabilities, Apache Doris is used as a unified storage and query engine to provide services for each business line.

The Feiliu platform is a unified comprehensive platform built by Wuxi Xishang Bank to cope with future real-time business scenarios. It mainly includes real-time collection, real-time synchronization tools, real-time data warehouse , real-time calculation and data services.

Introducing Apache Doris to build a big data real-time data warehouse.png

01 Improve data flow links

Starting from the characteristics of bank data and combining the functional advantages of Apache Doris, Wuxi Xishang Bank has rethought and improved the data flow link:

Synchronizing historical data from offline data warehouses minimizes risks: The article mentioned that due to the huge scale of bank data, if the full amount of historical data is synchronized directly from Oracle and MySQL, a large amount of data will flow through firewalls and switches, causing other business requests to be blocked and Problems such as service timeout. In order to avoid these potential risks and problems, first build the Doris table structure in batches based on Oracle and MySQL, and then use HDFS Broker to synchronize the full T-1 data from the offline data warehouse Hive ODS layer to Doris, thereby minimizing risks.
Real-time incremental extraction, safer extraction mode: Real-time extraction will produce a very small amount of disk IO, memory, and CPU consumption. In order to avoid affecting the main business database, by default, the business slave database or the same city disaster recovery will be selected. Library extraction in real time. For business needs with high timeliness requirements, full evaluation is required before data can be extracted from the main business database.
Build the Kafka layer to ensure data consistency: Build the Kafka layer as an intermediate data transmission layer to ensure data orderliness and consistency. The Key of the data sent by Datapipeline is configured as Database-Table-PK, and it is sent to a partition (Partition) of Kafka Topic in an orderly manner according to the same dimension. Since Kafka Topic's respective partitions are stored in order, downstream consumers can process data in order to avoid out-of-order effects on the accuracy of real-time data warehouse data. In addition, the Kafka layer can be used as a data public layer and can be used in marketing, risk control and other business scenarios.
Data is written in real time to ensure that data is not lost or duplicated: In actual application scenarios, the offline link performs offline data batching from 11 pm to 6 am on day T-1, and uses the HDFS Broker method at 10 o'clock on day T. Table historical data initialization. The real-time link uses Flink to directly point to the Kafka Topic consumed at T-1 at 10 pm for real-time data synchronization. However, some overlapping data will appear during the real-time consumption process. To deal with this problem, the Unique Key model of Apache Doris is selected (this model supports data idempotence), which can quickly cover overlapping data; and the Flink-Doris-Connector is used to improve the real-time data warehouse link to ensure consistent real-time data synchronization. It’s not heavy to throw away.

02 Flexible data services

In order to provide accurate and efficient query responses, Wuxi Xishang Bank has adopted the following three methods to implement data services:

Offline data query: For offline requirements, data needs to be quickly queried. Wuxi Xishang Bank regularly imports data from the offline data warehouse into the Doris table of the real-time data warehouse. This enables fast querying in the real-time data warehouse to meet the needs of offline data analysis and decision-making.
Simple real-time requirements: For uncomplicated real-time requirements, Wuxi Xishang Bank uses the efficient query capabilities of Apache Doris to provide the ability to directly configure the data service interface on the "Fei Liu" platform. Users can use SQL based on the ODS layer of the real-time data warehouse Perform manual configuration. In this way, the needs of simple real-time data queries can be quickly met.
Complex real-time requirements: For complex real-time requirements, Wuxi Xishang Bank uses real-time Kafka data flow and Flink light computing to write the data flow into the DWD layer table of the real-time data warehouse, and based on details on the "Fei Liu" platform The SQL of the table is aggregated again, and the data service interface is manually configured to meet the needs of complex real-time data query.

Facing more diverse service scenarios

01 BI report query response within seconds

Based on Apache Doris, Wuxi Xishang Bank meets the needs of multiple scenarios such as daily data analysis, daily data retrieval, and BI real-time reports. The query response time is greatly shortened, and query results can be returned within 1 second , which greatly reduces the waiting time of data analysts. Cost and consumption of server resources.

For example, in terms of BI real-time reports, Wuxi Xishang Bank has established real-time loan data tables, real-time deposit data tables, account balance tables and other reports. **These reports have an average of 253 lines of SQL code and an average response time of 1.5 seconds. **In addition, by optimizing query performance and data model design, Wuxi Xishang Bank can generate accurate real-time reports in a short period of time to provide timely data support for business decisions.

02 Support personalized marketing plans

In terms of marketing data services, Wuxi Xishang Bank based on Apache Doris to enrich customer tags and improve accurate customer portraits, and carried out various marketing activities such as net asset increase activities and artist blind box activities. Through the analysis of real-time data, banks can observe the conversion status of active users in a timely manner, and promptly adjust the operation selection strategy to achieve personalized marketing from "one thousand people have one face" to "one thousand people have one face".

For example, in marketing activities such as net asset increase activities and artist blind box activities, Wuxi Xishang Bank uses the capabilities of Apache Doris real-time data warehouse to continuously collect, analyze and feedback activity data. By observing user conversions in real time, we can promptly adjust the operation selection strategy to ensure the match between personnel and activities. This personalized marketing strategy allows banks to better meet customer needs and increase engagement, response rates and user stickiness.

03 Efficient risk identification and control

The introduction of Apache Doris enables Wuxi Xishang Bank to calculate risk control characteristic variables and abnormal transaction behaviors faster. Taking new user registration as an example, when users fill in information, the system can quickly determine the results of the approval strategy based on real-time risk control characteristic variables, optimize the strategy model in a timely manner, and ensure the quality and accuracy of approval.

Wuxi Xishang Bank is also able to identify and prevent potential risks in a timely manner. For example, banks can collect and monitor transaction data such as a large number of transactions and abnormal transaction amounts in a short period of time in real time to detect abnormal transaction behavior and fraud in a timely manner. Through real-time data analysis, banks can quickly identify potential risks and take appropriate measures to prevent and respond.

In addition, Wuxi Xishang Bank also uses Apache Doris real-time data warehouse to conduct real-time analysis of customers' credit history and credit application information. By quickly determining whether the customer's application amount meets their repayment ability, banks can make timely risk assessments and decisions to effectively control credit risks.

04 The data of the seven-day trading flow sheet is automatically updated.

In actual application scenarios, the amount of data in the transaction flow sheet is very large, involving transaction serial number, transaction date, transaction type, transaction amount and other data. In order to ensure timely updating of data, Wuxi Xishang Bank chose to use the feature of Apache Doris dynamic partition table. This feature can automatically create partitions and automatically delete transaction flow data older than seven days to achieve automatic updating of data in the seven-day transaction flow table. Specific operations include the following steps:

Construct a pseudo column with business date as the joint primary key;
When the ID data is tran_dateupdated across days, the code performs a table return operation;
Find the corresponding Date value in the Insert and partition table of the data, and splice it into Update Json and update it into the database.

The data of the seven-day trading flow sheet is automatically updated.png

With the help of Apache Doris' dynamic partitioning and table partitioning feature, it can not only ensure the stable operation of the underlying primary key and server, but also automatically update and retain only seven days of transaction data for analysts to query, and meet the 1.5-second query response requirement under one million QPS .

05 High concurrency point query

Early marketing and risk control application scenarios mainly relied on two sets of HBase clusters to support enumeration services. However, in actual applications, problems such as Master/Regionserver abnormal exit and RIT will be encountered. To avoid this problem, you can take advantage of Apache Doris' high concurrent query capability and enable the Merge-on-Write strategy when creating the Unique Key table, so that the primary key query can be completed through a simplified SQL execution path, with only one RPC required. Complete quick query response.

最终通过在三台节点上进行压力测试，在为每台节点配置了 8C、10GB 的情况下，获得了以下显著收益：

In a query scenario where a single table contains 50 million data, the QPS is as high as 25,000;
In a multi-table read and write scenario involving 50 million data, QPS also reaches 20,000;
The stability of complex SQL queries also remains at a high level of QPS 25,000;
In the real-time reading and writing scenario of multiple tables, QPS can also be stabilized at 25,000.

Conclusion

目前 Apache Doris 在无锡锡商银行已经接入数百张实时表、上百数据服务接口、接口 QPS 达到数百万级别。此外，Apache Doris 作为统一查询网关，显著提升了历史数据分析的效率，与原来分钟级响应时间相比，查询提速超 10 倍。

In the future, Wuxi Xishang Bank will continue to explore the advantages of Apache Doris and promote its deeper application in real-time scenarios.

In terms of performance: further optimize high-concurrency query, automatic partitioning and bucketing, execution engine and other capabilities to improve data query response efficiency;
In terms of load balancing: build dual clusters to achieve architectural load balancing; at the same time, the architecture early warning and circuit breaker mechanisms will be improved to ensure uninterrupted business operations;
In terms of cluster stability: Realize the "division of labor and collaboration" of the Apache Doris cluster, so that each of them can undertake tasks such as calculation and storage of real-time data warehouse, accelerated query of data services, etc., to further improve the stability and reliability of the system.