Wuhan Yuan Chuanghui Returns, Let’s Talk About Large Models on April 20th”

Author｜Cheng Wei, MetaAPP big data R&D engineer

GitHub ｜https://github.com/ByConity/ByConity

ByConity is ByteDance's open source cloud-native data warehouse. It meets the needs of data warehouse users for elastic resource expansion and contraction, read-write separation, resource isolation, strong data consistency, etc., while also providing excellent query, write performance.

MetaApp is a leading game developer and operator in China, focusing on the efficient distribution of mobile information and committed to building a virtual world for all ages. As of 2023, MetaApp has more than 200 million registered users, has collaborated on 200,000 games, and has a cumulative distribution volume of over 1 billion.

MetaApp paid attention to ByConity in the early days of open source and was one of the first users to test and launch it in the production environment. With the idea of understanding the capabilities of open source data warehouse projects, the MetaApp big data R&D team conducted a preliminary test on ByConity. Its storage-computation separation architecture and excellent performance, especially in log analysis scenarios, support for complex queries on large-scale data, attracted MetaApp to conduct in-depth testing of ByConity, and eventually fully replaced ClickHouse in the production environment, reducing resource costs by more than 50%. %.

This article will mainly introduce the functions of the MetaApp data analysis platform, the problems and solutions encountered in business scenarios, and the help of introducing ByConity to its business.

MetaApp OLAP data analysis platform architecture and functions

With the growth of business and the introduction of refined operations, products have put forward higher requirements for the data department, including the need to query and analyze real-time data and quickly adjust operation strategies; conduct AB experiments on a small group of people to verify the effectiveness of new functions It reduces data query time and difficulty, allowing non-professionals to analyze and explore data independently. In order to meet business needs, MateApp has implemented an OLAP data analysis platform that integrates event analysis, conversion analysis, custom retention, user grouping, behavior flow analysis and other functions .

This is a typical OLAP architecture, divided into two parts, one is offline and the other is real-time.

In the offline scenario , we use DataX to integrate Kafka data into the Hive data warehouse and then generate BI reports. BI reports use the Superset component to display results;

In a real-time scenario , one line uses GoSink for data integration and integrates GoSink data into ClickHouse, and the other line uses CnchKafka to integrate data into ByConity. Finally, the data is obtained through the OLAP query platform for query.

Function comparison between ByConity and ClickHouse

ByConity is an open source cloud-native data warehouse developed based on the ClickHouse core and adopts a storage-computation separation architecture. Both have the following characteristics:

The writing speed is very fast, suitable for writing large amounts of data, and the amount of data written can reach 50MB - 200MB/s
The query speed is very fast. Under massive data, the query speed can reach 2-30GB/s.
High data compression ratio, low storage cost, compression ratio can reach 0.2~0.3

ByConity has the advantages of ClickHouse, maintains good compatibility with ClickHouse, and has been enhanced in terms of read-write separation, elastic expansion and contraction , and strong data consistency . Both are applicable to the following OLAP scenarios:

Datasets can be large - billions or trillions of rows
The data table contains many columns
Query only specific columns
Results must be returned in milliseconds or seconds

In previous sharings, the ByConity community compared the two [from a usage perspective]

During the construction of the OLAP platform, we mainly focused on resource isolation, capacity expansion and contraction , complex queries, and support for distributed transactions .

Problems encountered when using ClickHouse

Problem 1: Integrated reading and writing can easily seize resources and cannot guarantee stable reading/writing.

During peak business periods, data writing will occupy a large amount of IO and CPU resources, causing queries to be affected (query times will become longer). The same goes for data queries.

Problem 2: Expansion/reduction is troublesome and takes a long time

Long expansion/shrinking time: Since the machine is in an IDC and belongs to a private cloud, one of the problems is that the node addition cycle is extremely long. It takes one to two weeks from the time the node demand is issued to the actual addition of good nodes, which affects the business;
Unable to scale up and down quickly: Data needs to be redistributed after scaling up, otherwise the node pressure will be very high.

Problem three: Operation and maintenance are cumbersome, and SLA cannot be guaranteed during peak business periods.

Often due to business node failures, data queries are slow and data writing is delayed (from a few hours to a few days);
There is a serious shortage of resources during peak business periods, and it is impossible to expand resources in the short term. The only way is to delete the data of some services to provide services for high-priority services;
During low business periods, a large number of resources are idle and costs are inflated. Although we are in IDC, IDC machine purchase is also subject to cost control, and node expansion cannot be unlimited. In addition, there is a certain cost consumption during normal use;
Unable to interact with cloud resources.

Improvements after introducing ByConity

First of all, ByConity’s separation of reading and writing computing resources can ensure that reading and writing tasks are relatively stable. If the reading tasks are not enough, the corresponding resources can be expanded to make up for the shortage, including using cloud resources for expansion.

Secondly, scaling up and down is relatively simple and can be done at the minute level. Since HDFS/S3 distributed storage is used and computing and storage are separated, data redistribution is not required after expansion and can be used directly after expansion.

In addition, cloud native deployment and operation and maintenance are relatively simple.

The components of HDFS/S3 are relatively mature and stable, with capacity expansion and contraction, mature disaster recovery solutions, and problems can be solved quickly;
During peak business periods, SLA can be guaranteed through rapid expansion of resources;
During low business peak periods, costs can be reduced by reducing storage/computing resources.

The use and operation of ByConity

ByConity cluster usage

Currently, our platform has stably used ByConity in business scenarios. Through successive migrations, ByConity has completely taken over the data of the ClickHouse cluster and has begun to provide services stably. We built the ByConity cluster using S3 plus K8s on the cloud. We also used a scheduled expansion and contraction solution, which can be expanded at 10 a.m. and reduced at 8 p.m. on weekdays. We only need to use resources for more than ten hours a day. . According to calculations, this method reduces resources by about 40%-50% compared to directly using annual and monthly subscriptions. In addition, we are also promoting the combination of private cloud + public cloud to achieve the purpose of reducing costs and improving service stability.

The figure below shows our current usage, using the OLAP server to perform joint queries on the ClickHouse cluster and ByConity in the offline IDC computer room. In the short term, the ClickHouse cluster will still be used as a transition for businesses that partially rely on ClickHouse.

In the future, we will query and merge data offline, while the resources consumed by Kafka will be used online. When expanding resources, you can expand the resources of vw_default and vw_write online, and rationally use public cloud resources to deal with the problem of insufficient resources. At the same time, the capacity is reduced during low business peaks to reduce public cloud consumption.

Comparison of ByConity and ClickHouse queries in business data

Test data set and resource configuration

Number of data items: Partitioned by date, 4 billion items in a single day, 40 billion in total in 10 days
Tabular data: 2800 columns

As can be seen from the above table:

The resources used by ClickHouse cluster query are: 400 cores and 2560G memory

The resources used by ByConity 8 worker cluster query are: 120 cores and 880G memory

The resources used by ByConity 16 worker cluster query are: 240 cores and 1760G memory

Summary of business SQL query results

The summary here uses the average value, as you can see:

Conventional OLAP - deduplication, retention, conversion, and enumeration can achieve the same query effect as the ClickHouse cluster (400C, 2560G) at a relatively small resource cost (120C, 880G), and can be doubled by expanding the resources (240C, 1760G ) to achieve the effect of doubling the query speed. If higher query speed is required, more resources can be expanded;
Not in filtering may require a moderate resource cost (240C, 1760G) to achieve similar effects to the ClickHouse cluster (400C, 2560G);
Bitmap may require greater resource costs to achieve similar effects to ClickHouse clusters.

General query/event analysis query