From 2018 to 2022: TiDB in the eyes of a big data engineer

foreword

Time is a killing knife. I have written my memories, thinking, understanding, and definition of TiDB in recent years into a true story. As a witness to the growth of a popular domestic database, I watched him grow all the way, and hope that TiDB The road behind can be smoother and smoother.

 

2018

When I first met TIDB, it was in August 2018. Based on the background of domestic data analysis product research, I searched for rare treasures on the Internet, and then TiDB jumped into my eyes. In the 2017 Analysys OLAP Algorithm Competition, TiDB actually won the championship. I have paid attention to this competition, but I never thought that TiDB won the championship. Who is TiDB? what does it do? The second place is Clickhouse, which we are familiar with. It is faster than Clickhouse. I believe that engineers are curious. Looking back at the topic of the 2017 Analysys OLAP Algorithm Competition, this is a user analysis business scenario and funnel analysis topic. In the process of purchasing goods, the events that need to be triggered include "start", "login", "search for goods", "view goods" ", "Generate Order" and generate related data in the background of the system. The business requirements are as follows

(1) In January 2017, the user conversions that triggered "search for products", "view products", and "generate orders" were sequentially ordered, and the time window was 1 day.

(2) Calculate the user conversions that triggered "login", "search for products", "view products", "generate orders", and "order payments" in order in January and February 2017, and the time window is 7 Days, the content attribute of the "Search product" event is Apple, and the price attribute of the "Browse product" event is greater than 5000.

According to the official introduction, the existing solutions on the market have low computational efficiency when the amount of data is large. In order to better improve the user experience, how to better solve the problem. The official gave a 60-point qualified solution. 1. HDFS for the underlying storage, 2. Create a Hive table with application ID, date, and event name as partitions, 3. Use presto for query, and customize UDAF, or use Spark core to customize the same logic.

Even the memory-based presto solution has just passed the test. My description of TiDB is recorded in my "XXX Report" in 2018, as follows

之前我们介绍的另外一个数据服务公司易尚国际,它举办了数据大赛,由PINGCAP获得了冠军, PingCap的产品叫做TIDB。准确来说TIDB不是一个数据查询引擎,它是一个数据库,性质属于 NEWSQL,它是立于谷歌spaner/F1的论文思想。TiDB 的设计目标是 100% 的 OLTP 场景和 80% 的 OLAP 场景 ,一脚踏两船,在OLTP和OLAP两个领域提供一站式的解决方案。

My first impression of TIDB is that it is a dual-purpose thing, the main business is OLTP, and the side business is OLAP. At that time, I had the impression that it had been open sourced on GITHUB, but the documentation community was not that mature, so I did not conduct installation testing to find out. The picture below is the database of the original research. There was no gauss and no OceanBase at that time.

image.png

2019

In 2019, TiDB's marketing was very successful, and everyone around me knew about TiDB, but I still didn't know the technical details of TiDB. I only knew that it was a big MySQL. It can do what MySQL can do, but MySQL can't do it. It can also do. Finally one day, I made up my mind to install 3 node lineups through ansible on my virtual machine, each node is 4C and 4G, did a benchmark test based on tpc-h, and put the theoretical understanding into practice, did Relevant experimental verification has a macro understanding of TiDB.

I have always wondered how TiDB won the competition in 2017? It was an OLAP game. At that time, TiDB did not have a Tiflash engine. It was mainly divided into three parts, namely TiDB, TiKV, and PD. TiDB itself is a stateless computing engine. It is estimated that special settings were made at that time to make hot data active in In TiDB, if it is cold data, look it up in TiKV. For the bottom layer of data writing to the hard disk, TiDB uses RocksDB, which is compatible with the original API of LevelDB, and has carried out a series of optimizations for LevelDB, including optimization of SSD hard disks, optimization for multi-CPU and multi-core environments, and increased LevelDB does not have functions such as data merging, multiple compression algorithms, and data storage management capabilities such as range query. RocksDB is the performance ceiling of TiDB. In the Analysys Competition in 2017, there was no Tiiflash at that time, and TiDB won the championship based on these conditions.

The innovation behind TiDB decouples OLTP storage from OLAP storage. The performance ceiling of OLTP is RocksDB, and when OLTP writes data, a copy of the data is written to TiFlash. OLTP storage and OLAP storage are physically separated, but the external logic is unified. For the management system, TiDB provides SQL access access, so that traditional DBAs can use TiDB like MySQL.

2020

TiDB attaches great importance to user experience and product quality. In 2020, TiDB has done two things that I still remember vividly. One is TiUP, which is a project for the life cycle management and maintenance of a TiDB cluster. The reason for initiating this project was that TiDB conducted research at the time. Even if the R&D engineers of a large factory would spend at least three hours installing a low-profile TiDB cluster, with the TiUP project, a TiDB environment was quickly deployed through TiUP. The second is the product bug catching competition. Every product has bugs. TiDB is open and honest, offering rewards to find bugs, regardless of boundaries, attracting the attention of foreign engineers. Among them, Dr. Manuel Rigger of Zurich University of Technology found out by citing the latest technology NoREC. Most bugs. Open source knows no borders. TiDB is committed to evangelizing open source technology, and is also looking for business opportunities in the market. The launch of TiDB Cloud has also been recognized by people. Based on the core of TiDB technology, TiDB Cloud realizes the elastic and agile availability of cloud computing, allowing developers to It is more convenient to use TiDB.

2021

 

In 2021, TiDB will launch version 5.0, which will introduce the MPP architecture based on the original HTAP engine TiFlash, provide a distributed computing engine that matches storage, and further improve parallel computing and analysis capabilities under massive data. By sharing the SQL front-end with TiDB-Server, the parser and the optimizer can be shared. TiDB provides an integrated entry for the business, can automatically select single-machine execution or MPP mode, and isolate transactional and analytical workloads. So that the two sides do not interfere with each other under the pressure of high concurrency. At this time, I am very curious about the working details of the optimizer of the TiDB module. In the current version 5.0, it needs to identify single read or batch operation, and also consider Tiflash's data distribution and continuous data management. Whether the data is found from OLTP or OLAP, The intelligent scheduling workload carried by TiDB is larger than before, and it needs more cooperation with TIKV and PD than before to complete the work with high quality.

image.png

In the TiDB Hackathon 2021 competition, the He3 team chose the tiered storage of hot and cold data. In the design, the hot data is stored on TiKV, and the cold data with less query and analysis probability is stored in the cheap and general cloud storage S3. At the same time, the S3 storage is used. The engine supports push-down of some TiDB operators to implement TiDB's analysis and query based on S3 cold data. In fact, this is the application scenario of the intelligent data integration system pursued by the industry. The integrated system can identify a wide range of data sources, including RDBMS, NOSQL, distributed systems, file systems, etc., and can identify and connect to the nearest storage. In the system, so the second and third times go directly to the nearest storage system. The key core technical challenge is intelligent scheduling to identify the consistency between the nearest storage system and the data source. Similarly, TiDB needs to identify the consistency between TiKV and S3 here, which puts forward higher requirements for TiDB.

I defined the conceptual scope of the TiDB module. The TiDB module is a computing engine that implements distributed [including single-computer computing and batch computing], is stateless, implements SQL, and manages client input connection sessions. There is a Presto in the open source world, but Presto is completely limited to the processing power of memory, but Presto cannot become a database without a storage engine. TiDB is combined with PD and TIKV to become a database. Independent TiDB modules can be embedded and integrated into other systems, such as Hot and cold data is stored in layers. In the entire data integration data link, TiDB, as an intelligent scheduling processing component, undertakes the management and transmission of upstream and downstream data.

 

Understanding TiDB from TPC-H and TPC-DS

Xiaobai was born as a farmer, determined to change his destiny, and finally opened a hardware e-commerce retail website. There are many people like Xiaobai in the world. They all want to take advantage of the Internet wave, take advantage of the trend, and make a lot of money, but they may not It is a hardware business, it may be takeaway, it may be toys, and it may be daily necessities. No matter what is sold, there are at least independent basic entities such as suppliers, users, commodities, etc., relying on the basic entities to have records of commodity purchases and commodity sales. If we investigate the details, there are also commodity evaluations, commodity collections, and commodity marketing. Consideration, here we ignore it. 8 tables including supplier information, country information, regional information, user table, commodity table, commodity supply table, retail order table, order detail table constitute the most basic elements of e-commerce framework, which can be regarded as the most general database design, based on The structure of the paradigm design is as follows, an e-commerce application website is launched.

image.png

 

A reliable and stable e-commerce platform can not only bear thousands of traffic, but also execute every behavior safely and without error. For example, when there are 10,000 products on the line, 100,000 customers come up to snatch it up. Behind the bottom layer, the same data object is browsed and read by multiple people, put into the shopping blue, the order is generated, and the money is deducted from the customer's wallet after the transaction is checked out, while the merchant account correctly flows into the customer's funds, and the number of goods is correctly deducted. Guarantee the consistency of the number of customers, merchants, and commodities. Even if the server is down, network delays, earthquakes and tsunamis caused by natural disasters, or subjective malicious reasons, the three will remain consistent, and customers will not deduct money, but merchants will not be deducted. If it has not been received, or the merchant has entered the account and the customer has not deducted the money, the number of goods and the number of transactions do not match, and the situation is overbought and oversold.

It is even more impossible to stop providing services to the outside world. The trading window is open 24 hours a day without sleep. For the Internet, every second is money. The infrastructure is stable and reliable, the service platform is stable and reliable, and the system services are stable and reliable. The process of business work has been audited, and no matter what problems occur, they must be traceable and traceable.

For e-commerce platforms with ever-increasing business, traditional stand-alone solutions have IO bottlenecks and cannot cope with high traffic, while NOSQL solutions completely reject relational paradigm modeling, and can only use document modeling or key-value modeling for native business applications The intrusion is large, which increases the workload of traditional application development engineers. The middleware solution liberates the productivity of application development engineers, but brings more operation and maintenance work to the DBA in the background. 1. Decompression problem. One of the nodes is under too much pressure. How to transfer the pressure to other nodes under the premise of ensuring normal business. 2. Expansion to add new nodes, how to make the new nodes just divide the appropriate data.

When this e-commerce website accumulates a large amount of data, data analysis needs to be done to understand user preferences and market demands. Based on the structure of 8 tables, we built 22 models, which is the tpc-h benchmark. Each model has fewer multi-table associations. The longest has 4 table associations or 5 table associations. After the association, aggregation, filtering, grouping, merging, sorting, etc. are performed.

image.png

In order to deeply understand user preferences and market demands, there are more and more data dimensions, and it is necessary to add evaluation, collection, browsing, and click-hide dimensions. Moreover, the continuous growth of data is proportional to the data computing power. Adding more servers can certainly Improve computing power. However, paradigm modeling that eliminates redundancy is not suitable for complex business analysis scenarios, such as zipper tables, where a little bit of data is added every day, and the business hopes that that little change occurs in a single table, and the underlying analysis model changes too much.

Ali's business is also dimensional modeling. Dimensional modeling does not eliminate data redundancy like relational modeling. On the contrary, it integrates more data redundancy and improves computing power. tpc-ds adopts dimensional modeling. The essential difference of tpc-h is the application modeling from different angles. tpc-h can also serve OLTP online applications, while tpc-ds does not consider the paradigm at all, and cares more about the subject of analysis. The tpc-ds model simulates the sales system of a large nationwide chain retailer, which contains three sales channels: store (physical store), web (online store), and catalog (telephone ordering). Each channel uses two tables to simulate Sales records and returns records, with related table structures containing both product information and promotion information. TPCDS adopts a snowflake data model. The sales, return, and overall inventory lists of the three channels are used as fact tables, and other information related to products, user-related information, time information, and other information are used as dimension tables. See the name, as detailed in the table below:

Obviously, the business theme of tpc-ds is the sales channel analysis model. According to the model, business personnel can quickly compare channels, channel strength, channel link comparison, and purchase order analysis, but they are not good at financial analysis models, human resources analysis models, Shipping analysis model, inventory management analysis model. The 7 fact tables and 17 dimension tables are as follows. Simple two-table associations are used to complete most business analysis scenarios through the star model, and even more complex business can be completed through the snowflake model.

image.png

 

The characteristics of tpc-h are as follows

  • Test large-scale data, solve big data problems

  • Answers to practical business questions

  • Execute demanding or complex queries (e.g. ad hoc queries, reporting, iterative OLAP, data mining)

  • Characterized by high CPU and IO load,

  • Periodic synchronization of OLTP database resources through database maintenance

As mentioned earlier, TPC-DS adopts dimensional modeling, which is modeled with the paradigm of TPC-H, so the ETL process must occur. The actual version of the TPC-DS application needs to open up multiple data sources, connect to multiple databases, clean, organize, standardize the data, and place them in the data warehouse. Adjusting the data tiering, classification and grading according to the business, and designing a theme-oriented data mart requires a global and development perspective. Modeling should be based on understanding the overall business process.

 

image.png

end

Finally, summarize the problems that TiDB solves. TiDB is an open source, distributed, computing-storage-separated, elastically scalable, row-column, and HTAP-enabled database. First of all, it can be used as a database for business applications. No matter how much concurrency its OLTP can go to, it will definitely be higher than MySQL. It supports transactions, ACID, concurrency, blocking, and serialization, so it can solve 100% of business problems. Data storage is deposited in Tikv, and the same data is deposited in TiFlash. TiFlash is a column storage storage. TiDB accesses TiFlash through MPP. However, due to relational paradigm modeling, TiDB may be similar to TPC-H. Complex SQL for data query can only solve 30% of the analysis problems at best. For the construction of more complex data application systems, it is necessary to transfer data from TIKV to HADOOP through its own ecological tools, and establish a separate data warehouse for re-dimensional modeling and organization management, which can solve 70% of the problems.

 

image.png

 

 

Author of the article: angryart Release time: 2022/3/25

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5674736/blog/5501432