Liu Qi: The database market is showing a trend of diversification, and 20% of traditional databases will be replaced in the next two years

A few days ago, our CEO Liu Qi accepted an interview with "Love Analysis ifenxi", analyzed the current development trend of the database market, the characteristics and application scenarios of TiDB, and revealed the company's future development layout. The following is the love analysis report and interview record, there is a lot of information, enjoy :)

Research | Li Zhe Wang Qi

Written | Zhe Li

Even if the scope is narrowed down from big data to the database segment, PingCAP is still a very special company, and its product TiDB is one of the few databases on the market for HTAP scenarios.

Traditionally, databases are divided into transactional databases (TP) and analytical databases (AP).

The NoSQL databases that have emerged in recent years, such as MongoDB and Hadoop-based Hbase, are mostly analytical databases that solve large-scale data query and analysis problems through distributed architecture.

However, the transactional database that carries the production system has always been dominated by traditional database vendors. Oracle, IBM, etc. occupy the traditional large-scale enterprise market. Most of the small and medium-sized enterprises and Internet companies use the open-source technology MySQL, and few new technologies and new companies can enter the market. this market.

In 2012, Google's Spanner was born, which is a transactional database based on a distributed architecture. Inspired by Google, a series of emerging database manufacturers such as CockroachDB (cockroach database) have emerged abroad to solve the TP problem, but the domestic market is almost blank, and no startups can be found to develop such databases.

In 2015, PingCAP was established to fill the domestic gap.

The team with the Internet background uses the open source model as the database

Unlike other database vendors on the market, most of the founding teams of PingCAP are from large Internet companies, such as Wandoujia and JD.com, and almost none of them are from traditional IT or database vendors.

With the background of the Internet, each member of the founding team has experienced a period of exponential data growth, and has experience in processing massive data. When making database products, scalability is a priority.

At the same time, because most Internet companies adopt MySQL technology, TiDB is first compatible with the MySQL protocol, which makes it easier for PingCAP to acquire customers.

Another feature of the Internet is that open source is the first. PingCAP has established the method of using open source as a database from the first day. But unlike other teams, Liu Qi, the founder of PingCAP and others, used to be the authors of the distributed cache project Codis. They have the ability to operate open source communities and know how to develop products with the help of community power.

On the one hand, the open source community will expand the coverage of PingCAP products and bring in potential customers; on the other hand, through the operation of the open source community, PingCAP will focus more on the research and development of the core product TiDB, and other functions can be partially realized by the users of the open source community .

In addition, through user feedback, PingCAP can understand the potential needs of users as a reference for TiDB research and development.

The product supports both TP and AP, strong consistency and scalability are the main features

Initially, TiDB only solved the TP problem, but in the actual application process, it is very difficult to directly let customers replace the original MySQL database with a new database, especially when the database manufacturer is a little-known start-up company.

The practice of most enterprise customers is that the front end still retains the traditional MySQL database, and uses the TiDB database as the backend data mart, which is connected to the front-end database. However, the real-time performance of this data mart is much better than that of the Hadoop architecture, and it can run actual production system.

After running in this way for a period of time, after the customer approves PingCAP's products, the MySQL database will be gradually replaced, and TiDB will be used as the front-end database.

When customers use the TiDB database as a data mart, because the front-end database needs to query data from this data mart, higher requirements are placed on the query function of the TiDB database. TiDB adjusted its own database executor to expand the AP function.

In this way, TiDB supports both TP and AP functions and becomes a distributed HTAP (Hybrid Transactional/Analytical Processing) database product.

In the TP scenario, TiDB has the characteristics of strong consistency and can carry industries that are highly sensitive to data consistency, such as finance. Compared with traditional databases, TiDB's scalability is the biggest advantage. TiDB can improve performance by continuously adding machines.

In the AP scenario, compared with Hbase, PingCAP has better real-time performance and faster data processing speed.

At this stage, it mainly covers Internet fields such as Internet finance and games, and sales leads mainly come from open source communities

Compared with traditional companies, Internet companies are more likely to try new technologies, and teams with Internet backgrounds are better able to understand the business characteristics of Internet companies.

At the same time, the development speed of Internet companies is much faster than that of traditional enterprises, and the data volume is growing rapidly, and the demand for improving the underlying technical architecture and database performance is even stronger, especially in the game industry and Internet finance industry.

These factors prompted most of PingCAP's early customers to come from Internet companies, and Tongcheng Travel, 360 Finance, and Mobike have all become PingCAP's customers one after another.

As of the end of 2017, the overall team size of PingCAP has reached about 100 people, of which more than 80% are R&D, and there is only one full-time salesperson.

A salesperson's ability to acquire customers is very limited. PingCAP mainly acquires customers through the open source community, and sales staff are only responsible for following up interested companies. In 2017, the number of users applying the application in the actual production environment reached 200, and eventually more than a dozen paying customers were generated.

At this stage, PingCAP still focuses on product polishing and community operation, and has not yet entered the stage of large-scale product promotion. Therefore, in 2018, PingCAP will consider entering traditional industries such as finance, medical care, and logistics, but will not increase sales teams on a large scale. A more cautious market strategy will still be adopted.

Recently, iAnalysis conducted an interview with Liu Qi, the founder of PingCAP. He elaborated on PingCAP's business model, future strategy, and the future development trend of the database industry. Part of the interview will now be shared.

Based on the original intention of solving the problem of database scalability, the product can meet the business needs of TP and AP at the same time

Love Analysis: What was your original intention for establishing PingCAP?

Liu Qi: I already had this idea when I was working on JD. However, this method has disadvantages. First, its elastic expansion ability is relatively poor, the second is relatively poor ease of use, the third is that the mental burden of programming is relatively large, and the fourth is relatively weak expressiveness.

At that time, I was working on a project that also needed a distributed database, but there was no satisfactory product on the market.

Therefore, the initial positioning was to solve our own problems. In the middle, we also developed a distributed cache. After that, we started to solve the problem of database scalability and started a business.

AiAnalysis: As the underlying technology, customers will be very cautious in choosing suppliers. How did they acquire customers in the first place?

Liu Qi: In 2016, after we got the A round of financing from Yunqi Capital, we began to think about how to acquire the first batch of users. Indeed, it is risky for users to apply a new database online. Who wants to take the risk of trying a brand new database with their online business?

Gaia Entertainment is our first user. At that time, there was a problem with their MySQL database, the online query speed was extremely slow, and the entire system had become unusable, and it was difficult to conduct business without trying new technologies. Our product was still in beta, and they started pushing the database online.

Because taking a new database online is really risky, many users do it another way. There is a bunch of MySQL running online. They build a large data cluster in the back and collect all the data here. It looks a bit like a data warehouse. Because we are compatible with the protocol, we can copy the data, and they can query in real time.

In the game industry or risk control management with high real-time requirements, they urgently need this technology to solve problems.

We have disclosed a lot of financial cases, and quite a few of them are used in real-time risk control scenarios. The advantage is that it is not directly aimed at online business, the risk is smaller than that of online MySQL, and it just solves their pain points.

After this stage, if the customer feels that the technology is stable enough, he will withdraw from the line, and then push our products to the forefront to support all businesses.

When customers regard our database as a data warehouse, the query complexity is actually very high. Our database can help customers to do some things that they were afraid to do before. An SQL query statement is even several pages long.

Then the problem is, our design itself is not for AP business, and the query function is focused on AP, so when we optimize the executor, we also make corresponding adjustments and expand the AP function.

In this way, our product can support both online TP and AP services, and our product becomes HTAP.

When this product is done, we find that the product features are very obvious, there is no strong competitor in this field, and this product meets the needs of users. In many cases, the needs of users cannot be simply divided into TP or AP. In fact, there is no clear definition, and even customers do not care about these, but only hope to solve their own problems.

Love Analysis: From the perspective of data writing and querying, there are differences between rows and columns. How does TiDB implement it in a table?

Liu Qi: Rank and column are just a form of storage. From a technical point of view, ranks can still be changed.

For example, the cold data is slowly converted into column storage in the background, and then the newly written data still uses row storage. The front desk is still a standard row storage, and it is converted into row storage or column storage according to the temperature of the data.

In fact, the latest paper has put forward a new point of view. The storage of data is not purely row storage or column storage, but according to the frequency of access, frequently accessed data uses row storage, and does not need to scan the entire table. Still a lot.

Love Analysis: When Google is doing Spanner, it emphasizes its scalability. Is the computing power requirement relatively low?

Liu Qi: This is a concept of Google in the past, but in this case, if you do some relatively complex operations, the response time of the database will be longer, which is determined by the storage format.

However, in Google's 2017 paper, the storage format has been changed to partial mixed storage. We have the same iterative route as Google, and our storage format changed earlier because we met the actual needs of users earlier.

Love Analysis: Is there a certain contradiction between algorithms and scalability, and will complex algorithms affect its scalability?

Liu Qi: Algorithms have nothing to do with scalability. Algorithms mainly affect the efficiency of execution.

For example, if it is stored in a column, the execution efficiency is higher. For example, the bank sums the amounts of all accounts. If it is stored in a column, it will be very simple, but if it is stored in a row, it needs to scan the amount data in each row, which is very efficient. low, but it won't make much difference at the lower computational level.

Love Analysis: What adjustments should the database need to make when it is pushed to the foreground?

Liu Qi: It is necessary to decide how much concurrency to use according to the load of the entire system, and some optimizations will be made.

Suppose there are 100 machines, and there is such a data cluster, which is evenly pushed to each machine for calculation. In the case of high concurrency, each robot may be very busy. At this time, it is useless to add tasks to it. , the machine will crash.

However, if there is a "smart" scheduler that controls the instructions, and maintains high concurrency, it schedules different machines to perform different operations, so that the machines will not be very busy, but the problem is that it will bring comparisons. long delay.

Of course, the same data may not necessarily be calculated by CPU, but by GPU or FPGA, which requires higher schedulers. According to the development trend, the ability of schedulers is an important indicator to measure the performance of a database.

Love Analysis: How does TiDB achieve real-time performance?

Liu Qi: Because it is a distributed structure itself, its performance can continue to expand, and it doesn't matter how much data is input in front of it. If you feel that the calculation is not fast enough now, you can realize the calculation by adding a machine.

The speed is also related to the calculation, and some calculations cannot be pushed to all nodes. For example, if I want to get all the data back for sorting, there is no way for all nodes to do it.

In this case, the role of the optimizer is more important, it will identify which calculations need to be pushed down for parallel operations, and which ones just need to make a decision.

Love Analysis: MySQL architecture, can data be migrated to TiDB without sensory migration?

Liu Qi: We have considered this problem from the very beginning of the design. For MySQL, we can achieve non-inductive migration. If it is Oracle or other protocols of DB2, it may involve changing the code.

Love Analysis: For other protocols, how long is the migration cycle?

Liu Qi: This also needs to consider the complexity of the business. For example, the original business has 100,000 SQL, as long as you have to verify it again, if the business itself is more complex, it will be faster. On the MySQL protocol side, we can do POC soon.

Love Analysis: Have you considered the next step to support the rapid migration of Oracle or DB2?

Liu Qi: We have no plans in this regard, because these technologies are no longer used in new businesses. If you consider these, the purpose is to cut into the old project. There is a problem with compatibility when cutting into old projects. Users need to know what is the compatibility of the new technology? Can I use new technology to replace it with confidence?

Compatibility is not only functional compatibility, but also bugs. It is difficult to achieve 100% compatibility. The original programmers of the enterprise may also leave. If the old business is replaced, the workload and risks will be great.

At this stage, Internet-oriented industries such as Internet finance and games are key industries, which are suitable for scenarios with large amounts of data and high business complexity.

Love Analysis: Which industries are the products mainly aimed at?

Liu Qi: In the process of commercialization, the most important thing is to make the product and then improve its functions according to the needs of customers.

Also, our product is open source. The advantage of open source is that when users use it, they will timely feedback their experience and problems encountered, and in the process, they will find out who our potential users are.

Our first user is a game company, which is actually beyond our expectations. We think it may be the Internet first, because the Internet is more aggressive to new technologies.

The game industry also has its own characteristics. The most profitable game company must be the operation of popular games, and the daily turnover may be tens of millions. They hope that their infrastructure is stable and strong enough. Once they encounter bottlenecks and then shut down and rebuild, the losses will be very large. Therefore, they also hope to solve the problem through new technologies.

Another is the Internet and traditional industries. When Internet companies use our new products, they are still very conservative. Because so many MySQL are already in use, they will feel that it is very risky to suddenly change to a new technology.

However, enterprises such as Internet finance still have high requirements for real-time performance. To conduct risk control management through real-time information, the previous solutions could not be satisfied, so they chose to use our products.

Love Analysis: What are the application scenarios of TiDB?

Liu Qi: Our database is relatively versatile, and is generally oriented to new business needs. We have not designed the database as a product for a certain industry.

When it comes to the advantages of our products, the data volume of customers must reach more than 100 million levels. If the data volume is relatively small, there is no need to use a distributed database; in addition, the complexity of the business is relatively high, so our advantages are more obvious.

Love Analysis: Which industries will you focus on in the next step?

Liu Qi: From the perspective of revenue, finance should be an industry that we focus on, and data growth in other fields such as logistics and medical care is also relatively fast.

The team is mainly from Internet companies, and there are very few sales staff

Love Analysis: User promotion progress of PingCAP in 2017?

Liu Qi: We had 200 users running in the production environment in 2017. The unit price of the product is relatively high, and the number of paying users is less.

Love Analysis: TiDB is an open source technology. What enhancements will be made when providing enterprise-level products?

Liu Qi: Although we provide an open source technology, some of them are closed source, such as monitoring operation and maintenance components, backup tools, security tools, etc.

For enterprise applications, it must have a very beautiful user interface and a wide range of operating tools, which is the way our enterprise version provides.

Another part, we call it Database & Service, we provide not only a database, but a database platform, and enterprise users can apply for TiDB data cluster. If there is no such thing, the administrator may need to deal with it manually, and the user experience is very different.

Love Analysis: How does TiDB charge?

Liu Qi: We now have two considerations: on the one hand, we can use cloud deployment, and we can see the database entry of Tencent Cloud. This business model is relatively simple. Like other products on the cloud, it is charged by lease.

On the other hand, you can buy our subscription or our license, which is calculated according to the number of nodes.

Love Analysis: What is the size of the company's team?

Liu Qi: There are about 100 people in the company, and R&D accounts for a relatively high proportion of 82. There is only one salesperson, and the sales are relatively small because the users are all found by themselves, and we do not have much investment in this aspect.

Our requirements for R&D are still very high, including the external support and response speed of R&D personnel. While it may not look as exaggerated as Oracle, there are many outside companies contributing to us.

For example, a lot of the scheduler code is contributed by Mobike, and the optimization in many scenarios is contributed by Toutiao, including Samsung Research Institute in South Korea, etc., and many people are helping us with testing, which also reflects one of the open source technologies. benefit.

Love Analysis: Will the R&D personnel undertake part of the pre-sales work?

Liu Qi: In 2017, there were still some R&D personnel doing pre-sales work, but in 2018 we will make some adjustments, which is also a very important task for us.

The construction of the personnel structure should form a complete system, with pre-sales, implementation, and R&D performing their respective duties, and assigning different people to solve problems at different stages.

Love Analysis: When there are fewer sales staff, are there higher requirements for the operation of the community?

Liu Qi: I think that there are more R&D personnel, and the communication with the community will be faster. The most important users in the community are developers, and the communication with developers must be smoother for R&D personnel, and sales personnel cannot replace this role. For example, if the user proposes that there are some problems with the code, the response speed of R&D will be very fast.

Large-scale users such as Toutiao, Mobike, and Tongcheng contact us actively because of pain points, and do not need sales to do additional work.

Of course, there are still many small-scale users in the community. Although small users do not have the ability to pay, they have a direct effect on the community.

They will test with their own scenarios and find many problems that we have never encountered before. The information they provide is also very important to us, so we will spend a lot of effort to run the community.

Love Analysis: PingCAP's team background is mostly the Internet?

Liu Qi: Yes, there are more Internet companies, all of them are relatively large-scale Internet companies, and they have all experienced the pain caused by the large amount of data.

In addition, there are those from traditional industries, and those from the financial industry before the sale. He is more aware of the usage scenarios of the financial industry.

Love Analysis: If you cut into the traditional industry, is there any change in the requirements for the personnel structure?

Liu Qi: At present, we don't think so. We hope that we can directly win customers through our products, which can reflect the advantages of our products. If it is a customer who uses the same database, we will not fight for it, and this is not our strength.

Love Analysis: How to balance energy between product development and community maintenance?

Liu Qi: We will definitely make a basic version first before promoting it in the community. When we encounter a bug, we must fix it, otherwise it will affect the use of many people. The two are promoted together without conflict.

In terms of internal research and development, we will quickly develop many new functions. These will not be applied to the stable version immediately, but will first release a beta version in the community. Bugs are found through user testing, and we will fix them. After continuous communication , we will release the stable version.

In this process, we need to let users continue to test and give us feedback through the community. Because whether the product works or not is not our own decision, but the user's judgment.

The integration of TP and AP is the future trend, and the database market will be more diversified in the future

Love Analysis: There is a certain contradiction between consistency and availability in the CAP principle. How to optimize it?

Liu Qi: We will provide an option in the future, users can choose according to their own needs, high consistency or high availability. For example, bank data requires high consistency, and Internet applications focus more on high availability. We will provide them to users and let them choose.

Love Analysis: What is the difference between NewSQL technology and previous technology?

Liu Qi: SQL was first applied in history. Why did NoSQL appear later? It is because SQL cannot be extended. Although NoSQL has the ability to expand, its expressiveness is relatively poor, it may not support transaction processing, and it does not have the tradition of SQL. Advantage.

NewSQL is equivalent to having two advantages at the same time, which can not only expand well, but also have the transaction processing ability and expressive power of SQL.

Love Analysis: Is there a trend of integration of TP and AP in the next step?

Liu Qi: We think this is the case. Users don't care whether it is TP or AP. Solving the problem is the last word. Whether it is online or offline, I definitely don't want to wait a day for real-time implementation.

The separation of TP and AP is caused by historical reasons, and was not distinguished when the database was first born. Now that the technology can do it, it is definitely still hoped that it will be integrated. In the case of complex data analysis, there may still be a separate AP, but our products are still in rapid iteration, and in the end, it depends on whose performance is better.

Love Analysis: Will there be another Oracle in the field of distributed database platforms in the future?

Liu Qi: Due to historical reasons, the status of Oracle is irreplaceable in a short period of time, but new database architectures are also emerging very quickly. Now Oracle has encountered unprecedented challenges. I think in the next two years, there will be 20% of the Traditional databases are replaced by new ones.

Looking at our user growth rate now, this trend is quite obvious.

Love Analysis: What changes will happen to the future market pattern?

Liu Qi: I think the market will become more diversified.

First of all, the current requirements are very fragmented, and traditional databases cannot express them well. For example, the requirements for Streaming are getting higher and higher.

The advantage of relational databases is that they are more versatile and balanced. However, in some scenarios, it is difficult to adapt to the current database framework, and it will definitely not be easier to use than a specially designed database, such as a graph database.

From the perspective of development trends, when NoSQL comes out, everyone will consider what kind of scenarios it can replace. Later, it is found that NoSQL still has many constraints. The emergence of NewSQL will indeed change the market structure. There should be two or three relatively large companies that will eat up most of the market in the future, but small companies still exist.

Love Analysis: Will the development of open source technology affect the business of database companies?

Liu Qi: In fact, open source technology has existed for a long time. For example, MySQL has a history of more than 20 years, but enterprise-level applications are not so simple after all, and there are still many problems that need to be solved by the team.

There will not be a completely free database in the future, even if it is open source, it will be charged.

Love Analysis: Internet companies generally develop their own infrastructure, will it affect PingCAP?

Liu Qi: This matter needs to be divided into domestic and foreign companies. Domestic companies like to build private clouds, but foreign companies are quite different. Many foreign companies have dismantled their own private clouds. The reason is very simple. They deploy their own private clouds. is not as efficient as using a full-fledged public cloud directly.

Nowadays, many Internet companies do not want to be locked in by companies like Oracle as in the past. I will not only use your database, but also have a certain degree of control. Because Internet companies grow rapidly and their needs change more obviously, they hope to have a certain understanding and control over databases, so as to facilitate Internet companies to modify data codes to meet their own customized needs.

Love Analysis: Will cloud vendors eventually become competitors of database companies?

Liu Qi: The relationship between database and cloud is a bit like the relationship between APP and APP Store. Cloud vendors may also do databases, but it should be more of a partnership.