What is the explosive distributed database?

1 What is distributed DB?

So far there is no authoritative organization to define, who is the authoritative organization? There is no consensus!

The rise of distributed DB represented by TiDB endows relational DB with a certain degree of distributed characteristics. In these distributed DBs, data sharding and distributed transactions will be their built-in basic functions. Business developers only need to use the JDBC interface provided by the framework, just like using traditional relational DB such as MySOL. ShardingSphere is a distributed DB middleware, which not only provides standardized data sharding solutions, but also implements distributed transactions and DB governance functions.

de facto standard

When a technology product dominates the market, it naturally becomes the de facto standard for its class. For relational DB, Oracle is the de facto standard, because when all DB products release new versions, they have to compare their own features with Oracle.

As an emerging basic software, distributed DB has not yet occupied the position of "de facto standard". There is no reference, let's do it ourselves and define the concept of distributed DB together.

From the outside to the inside, from the outside to the inside is the general law for people to understand things, so let us observe from two perspectives, the inside and the outside.

2 External Perspective: External Features

What features does distributed DB have and what pain points can it solve.

Business application systems are classified by transaction type:

  • Online Transaction (OLTP)

    Transaction-oriented processing, the amount of data for a single transaction is small, but results must be given in a short period of time. Typical scenarios include shopping, payment, transfer, etc.

  • Online Analysis (OLAP)

    It is usually an operation based on a large data set. Typical scenarios include generating personal annual bills and corporate financial statements.

It is difficult to have a product that fully satisfies both. Therefore, in the era of single DB, two different technical systems have evolved, that is, two different types of relational DB. After evolving to a distributed architecture, the two also adopt completely different strategies in architecture design, which is difficult to explain clearly under one framework.

First focus on discussing the distributed DB in the OLTP scenario. The "DB" mentioned in this tutorial defaults to "relational DB", and distributed DB also refers to the distributed DB that supports the relational model. That is, NoSQL is not discussed. On the whole, relational DB has better versatility because it supports SQL and provides ACID transactions, and cannot be replaced by NoSQL in a wider range of scenarios. The development of NoSQL has been proven through more than ten years.

The goal of distributed DB is to integrate the advantages of traditional relational DB and NoSQL DB, and it has achieved good results.

3 definitions

3.1 OLTP relational DB

Only using "OLTP scenario" as an attributive is obviously not accurate enough. Let's take a closer look at the specific technical characteristics of OLTP scenarios.

OLTP scenarios usually have three characteristics:

  • Write more and read less , referring to the number of requests. Moreover, the complexity of read operations is low, and generally does not involve the summary calculation of large data sets
  • Low latency , users have a low tolerance for latency, usually within 500 milliseconds, a little bit larger, that is, seconds, and a latency of more than 5 seconds is usually unacceptable;
  • High concurrency , the amount of concurrency increases with the volume of business, and there is no theoretical upper limit.

Can we draw such a conclusion: Distributed DB is a DB that serves OLTP scenarios with more writes and fewer reads, low latency, and high concurrency .

3.2 Massive concurrency

You may say that there is a problem with this definition. For example, relational DBs such as MySQL and Oracle also serve OLTP scenarios, but they are not distributed DBs.

Compared with the traditional relational DB, the biggest difference of the distributed DB is that the concurrent processing capability of the distributed DB is much higher than that of the former.

Traditional relational DB is often a stand-alone mode, and the main load runs on one machine. The concurrent processing capability of the DB is linearly related to the resource configuration of a single machine, so the upper limit of the concurrent processing capacity is also limited by the upper limit of the single machine configuration. This method of expanding performance by improving the resource configuration of a single machine is called vertical expansion (Scale Up).

In a machine, can you just cram more CPU and memory to improve performance? Of course it's not that easy. Therefore, the increase in the upper limit of the physical machine stand-alone configuration is relatively slow. That is, within a certain period of time, there will always be a performance ceiling for DBs that rely on vertical expansion. One of the reasons why many banks purchase minicomputers or mainframes is that these machines can install more CPUs and memory than x86 servers, which can push the ceiling higher.

Distributed DB is different. On the basis of maintaining the characteristics of relational DB, it increases the number of machines through horizontal expansion and provides much higher concurrency than single DB. This amount of concurrency is almost not limited by the performance of a single machine. I call this level of concurrency "massive concurrency".

How big is "massive concurrency"

No authoritative figures. Although it is theoretically possible to find one of the best machines in the world to test it, considering commercial factors, this number will not have any practical value. However, I can give an empirical value, the lower limit of this "massive concurrency" is roughly 10,000TPS.

Definition of version 2.0: Distributed DB is a relational DB that serves more writes and fewer reads, low latency, and massive concurrent OLTP scenarios .

3.3 + high reliability

Version 2.0 still has issues. Is there no need to use distributed DB if there is no massive concurrent demand? No, you have to consider the high reliability of DB.

In general, reliability is related to the failure rate of hardware devices.

Unlike banks, many Internet companies and small and medium-sized enterprises usually use x86 servers. The x86 server has many advantages, but the failure rate will be relatively high, and the annual failure rate is around 5%.

Some of the more reliable data comes from Google's paper Failure Trends in a Large Disk Drive Population , which examines failure scenarios for general-purpose device disks in detail. It gives a statistical chart of the annual failure rate of the disk, as shown below:

img

It can be seen that the disk damage rate will exceed 2% in the first three months, and this number will rise to about 8% in the second year.

You might say that this number is not very high.

But you need to know that for key application systems in the financial industry, it is usually required to have 5 nines of reliability (99.999%), that is to say, the service interruption time of the system in a year cannot exceed 5.26 minutes (365 *24 *60 *(1-99.999%) ≈ 5.26 ).

Moreover, not only the financial industry, as people rely on the Internet, more and more systems will have such high reliability requirements.

Based on these two figures, we can imagine that if your company has four or five key business systems and a dozen DB servers, the number of disks must exceed 100, right? Then we conservatively estimate that based on the damage rate of 2%, there will be 2 disk damages in a year. To achieve the reliability of 5 nines, you only have 5.26 minutes. Can you handle a disk failure? This is almost impossible to do, maybe you just rushed to the computer room, and the time ran out.

I guess you would suggest RAID (Redundant Array of Independent Disks) to improve disk reliability. This is indeed a way, but it will also bring performance loss and storage space loss. The copy mechanism of distributed DB can better balance the relationship among reliability, performance and space utilization than RAID. The copy mechanism is to store a piece of data on multiple machines at the same time to form multiple physical copies.

Back on the topic of DB, reliability is a little more complicated, including two metrics, Recovery Time Objective (RTO) and Recovery Point Objective (RPO). RTO refers to the time taken for fault recovery, which can be equated with reliability; RPO refers to the amount of lost data after service is restored.

The DB stores important data, and the DB in the financial industry is even more related to the security of customer assets, and any data loss cannot be tolerated. Therefore, high reliability of DB means that RPO is equal to 0 and RTO is less than 5 minutes.

Traditionally, banks have achieved this goal through a combination of two approaches.

The first is to purchase minicomputers and mainframes, because their stability is better than that of x86 servers.

The second is to introduce professional storage solutions, such as EMC's Symmetrix remote mirroring software (Symmetrix Remote Data Facility, SRDF). The DB adopts the active/standby mode, and saves the DB files and logs on the high-end shared storage, making the DB almost stateless. Once there is a problem with the main library, the standby library starts and loads the files stored in the shared storage, and continues to provide services. In this way, the RPO can be zero and the RTO can be relatively small.

However, this solution relies on dedicated software and hardware, which is not only expensive, but also has a closed technical system. In the context of going to IOE (IBM minicomputer, Oracle DB and EMC storage devices), we must find another way. Distributed DB is a good alternative. It relies on the mechanism of mutual backup and automatic switching between nodes, which reduces the impact of a single point of failure on the x86 server on the overall system and provides high reliability guarantee.

What's exciting is that this single point of failure handling mechanism can even be extended to the computer room level, through long-distance deployment across computer rooms. In this way, even if the entire single computer room fails, the system can still run normally, and the DB will never go down.

3.0 definition, distributed DB is a highly reliable relational DB that serves more writes and fewer reads, low latency, and massive concurrent OLTP scenarios .

3.4 Mass storage

Although a single DB can expand its storage capacity by relying on external storage devices, this method is not inherently a DB capability. Now, with the help of the distributed scale-out architecture, powerful storage capabilities can be obtained through the local disk of the physical machine, which makes massive storage a standard configuration for distributed DBs.

In the end, we finally got a definition of the 4.0 ultimate version. Distributed DB is a relational DB with massive data storage capacity and high reliability, which serves for write-more-read-less, low-latency, massively concurrent OLTP scenarios .

4 Inside Perspective: Internal Composition

Having the same external characteristics and efficacy is not necessarily the same thing.

The "heliocentric theory" refuted the "geocentric theory" by using 34 circles to explain the trajectory of celestial bodies; more than 100 years later, Kepler only used 7 ellipses to achieve the same effect, completely destroying the "geocentric theory". From Copernicus to Kepler, the effect is similar, but the degree of simplicity is quite different, which represents great scientific progress.

Therefore, after talking about the external characteristics, we must also observe from the internal perspective.

To cope with massive storage and massive concurrency, many solutions are similar in effect to the V4 definition. But they expose too much internal complexity to the user. Solutions with too many user constraints, too complicated usage process, and insufficient cohesion cannot be called mature products. At the same time, the mainstream view in the industry does not consider them to be distributed DBs.

Look at the categories:

4.1 Client component + single DB

Establish data sharding and routing rules through an independent logical layer to realize preliminary management of single DBs, enable applications to connect to multiple single DBs, and achieve concurrency and expansion of storage capabilities. As a part of the application system, it penetrates deeply into the business.

A typical product of this client component is Sharding-JDBC.

4.2 Proxy middleware + single DB

It manages data rules and routing rules in the form of independent middleware, exists as an independent process, and is isolated from the business application layer and single DB, reducing the impact on applications. With the development of proxy middleware, some distributed transaction processing capabilities will also be derived.

A typical product of this kind of middleware is MyCat.

img

4.3 Unitized architecture + single DB

The unitized architecture is a complete reconstruction of the business application system. The application system is split into several instances, and an independent single DB is configured to allow each instance to manage a certain range of data. For example, for the bank loan system, an independent application instance can be built for each branch to manage the respective users of the branch. When cross-branch business occurs, the ACID characteristics of the transaction are guaranteed by the application layer code through the distributed transaction component.

According to different distributed transaction models, the application system needs to cooperate with the transformation, and the complexity also increases accordingly. For example, under the TCC model, applications must be able to provide idempotent operations.

Before the emergence of distributed DB, some leading Internet companies used this architectural style. The application system of this solution requires the most transformation and the most difficult implementation.

The common feature is that the single DB can still be perceived by the application system. On the contrary, distributed DB converges technical details into the product and faces business applications as a whole.

What does the internal architecture of distributed DB look like? What is the difference between it and these three schemes? Please look forward to the next part of the series of tutorials.

5 Amazon's Aurora

It is obviously different from the distributed DB mentioned here. Do you know Aurora or similar products? What is the difference with the distributed DB in this article? Why is there such a difference?

Why is Aurora not a distributed DB? Aurora's separation of storage and computing makes Amazon's cloud storage service more efficient and easy to use. It is relatively successful in NewSQL.

features

Aurora proposes a new monolithic architecture to reduce network IO and synchronous blocking. Logically, it can be regarded as a huge monolithic DB, using distributed to support fault tolerance and high throughput. :

  • The method of Aurora sharding is to divide the total capacity of the DB into data segments of fixed size, and store data in each segment. Each segment is a group of machines (six), and I think it supports sharding.

  • Write more, read less, and low latency. This is what Aurora focuses on. It is achieved through log is database and asynchronous support.

  • Mass storage and reliable heap machines (of course, these machines must be managed by a control center, which is not mentioned in the paper)

  • High reliability is to rely on each data segment to redundant data to six machines in three availability zones

  • And it's also a relational DB

However, Aurora is characterized by Share storage, vertical expansion of computing nodes, horizontal expansion of storage nodes, and the impact of write performance on single-machine resources.

voting mechanism

Aurora is also used, 6 copies, and more than half of them confirm that the write is successful. But without sharding, you can’t write more, and it’s definitely not distributed.

Don't write too much (important!), the applicable scenarios are very different, so this is an important criterion. But because Aurora is based on shared storage, it is not unreasonable to say that it is distributed. Setting standards is just to make learning ideas clear.

The difference between the application of Aurora and distributed storage in actual scenarios

Aurora is still a relational DB, and a distributed storage system has a wide range, such as a distributed key-value system such as HBase. There is a big difference in function between the two.

Products such as AWS aurora, Ali polarDB, Tencent CynosDB, and Huawei's Taurus all have similar architectures: separation of computing and storage. All computing nodes access the same data on the storage nodes, which can also be said to be a distributed architecture. The limitation of this architecture is that writing cannot be scaled horizontally, which is enough for many small-scale applications, so it does not affect its commercial success.

Ali's PolarDB is a distributed DB? Which scheme does it use?

PolarDB is similar to Aurora in architecture, with separation of storage and computing, vertical expansion of computing nodes, and horizontal expansion of storage nodes. It means that its writing ability has an upper limit, but due to the simplification of log storage and other optimizations, its single-point ability is much stronger than that of ordinary MySQL.

6 Summary

Step by step, outline the six external characteristics of distributed DB: write more and read less, low latency, massive concurrency, massive storage, high reliability, and relational DB.

There are also some solutions similar to distributed DB capabilities. Their disadvantage is that they all need to be modified to the application system, and the degree of intrusion into the application is deeper; their advantage is that they can make the most of the stability and reliability of the single DB. These characteristics have been tested countless times.

The name of Distributed DB does some stretching.

"Distributed DB" can literally be decomposed into two parts: "distributed" and "DB", which means that it is an interdisciplinary product, and its theoretical basis comes from two fields. This also echoes the two different paths of product development. Some products start from distributed storage systems and then increase the capabilities of relational DB; other products start from single DB and add distributed technology elements. With the development of distributed DB towards industrial applications, driven by external demand, these two development ideas show a trend of further integration.

7 FAQ

① Write more and read less should not be added to the definition of distributed DB?

Distributed DB service writes more and reads less applications. I think distributed can be applied regardless of writing more and reading more. The key is that a single body cannot bear so many requests (regardless of reading and writing), so high concurrency is enough. Write more and read less Shouldn't the definition of distributed DB be added?

Emphasis on writing more and reading less is because:

  • The load of the write operation can only be the primary node of the single DB and cannot be transferred
  • However, if the read operation does not require high consistency, it can be transferred to the standby node, and even the consistency can be guaranteed under certain conditions. That is to say, a single DB can solve the problem of large read load through one master and multiple backups without introducing distributed DB

On a distributed basis, cloud dbms pay more attention to independent expansion after separation of computing and storage, and even dynamic expansion and contraction. It is self-driven, and it is easier to sell. This has also caused many problems. Aurora-type DBs propose the idea of ​​log is database to reduce writing pressure. Snowflake reduces network bottlenecks by establishing an intermediate distributed exchange layer.

② Distributed DB VS sub-database sub-table

Distributed relational DB, it feels like the client or middleware solution is directly used as a feature component of the DB server, and the sub-database and table are made more automated?

The biggest difference is:

  • Distributed DB usage experience is very close to relational DB, no need for additional application control, greatly reducing the difficulty of business code development
  • However, the sub-database and sub-table scheme does not support well in terms of distributed transactions and cross-node queries.

③ How do you say MyCat?

With the development of distributed DB, the market for middleware such as MyCat will become smaller and smaller. Of course, its usage scenarios may also turn to support for heterogeneous DBs, just like Presto.

④ It is said that Internet application data requests "read more and write less"

Therefore, there are expansion methods to solve the "read" problem, such as the separation of reading and writing from one master to multiple slaves, and full data caching. If the same indicator is mentioned, does it mean that distributed DB is not suitable for Internet applications?

The Internet can indeed satisfy "more reads and fewer writes" through one master, but only if the requirements for read-to-consistency are low. In financial scenarios, many read operations still cannot be run on the standby database, because the consistency does not meet the requirements. Therefore, it is not possible to generalize the Internet, but to distinguish between scenarios.

⑤ In the transaction scenario, the transaction compensation or data playback made by the transaction code in conjunction with the distributed DB

If the transaction code needs to cooperate to make compensation and playback, this probably means that it is not a distributed DB. Before the distributed DB matured, there were indeed many application codes that cooperated with the single DB. This type of application code will also be extracted to form an independent framework, such as Ali SOFA.

⑥ How does Newsql land?

For example, Bank of Beijing and China Everbright Bank both launched TiDB, and Oceanbase also landed in Bank of Nanjing.

⑦ Is BigTable special (proxy middleware + single DB (distributed file system))?

After all, it relies on Chubby as an intermediate layer, but the acquisition of data is directly completed by interacting with the file system.

BigTable is a distributed KV system, not a distributed DB. Because the distributed DB mentioned here is a relational DB implemented by a distributed architecture. Of course, it relies on a distributed file system at the bottom, so it seems to be divided into two layers, but the function and DB are very different. It is recommended to pay attention to the PGXC style distributed DB.

⑧ Distributed relational DB products based on OLAP usage scenarios

The most typical MPP architecture DBs, such as Greenplum and Huawei's GaussDB 200, use PostgreSQL in their cores. And Vertica. OLAP no longer emphasizes transaction support. If the requirements for data update are weakened, many big data ecological products can be included, such as Clickhouse, Hive on spark, and even Kylin can be regarded as generalized OLAP distributed DB.

Guess you like

Origin blog.csdn.net/qq_33589510/article/details/132094655