Another domestic database language was born, which is easier to use than SQL

First, the goal of the database language

1.1 What does a database do

The database software, with the word "library" in its name, will make people feel that it is mainly for storage. In fact, there are two important functions implemented by the database: calculation and transaction ! That is, what we often call OLAP and OLTP, the storage of the database serves these two things, and pure storage is not the goal of the database.

We know that SQL is the current mainstream language for databases. So, isn't it convenient to do these two things with SQL?
insert image description here

The transaction function mainly solves the need to maintain the consistency of data when writing and reading. It is not difficult to achieve this, but the interface to the application is very simple, and the code for manipulating the database reading and writing is also very simple. . If it is assumed that the logical storage model of the current relational database is reasonable (that is, the rationality of using data tables and records to store data is another complex issue, which is not expanded here), then SQL does not have much to describe the transaction class function. Big problem, because there is no need to describe complex actions, the complexity is solved inside the database.

But computing functions are different.

The calculation mentioned here is a broader concept, not just simple addition, addition and subtraction. Search and association can be regarded as some kind of calculation.

1.2 What kind of computing system is good?

Or two: write simple , run fast .
insert image description here
It is simple to write and easy to understand. It is to allow programmers to write code quickly, so that more work can be done per unit time; it is easier to understand when running fast, and we certainly hope to obtain calculation results in a shorter time.

In fact, Q in SQL means query. The original intention of its invention is mainly to do query (that is, calculation), which is the main goal of SQL. However, SQL is hardly adequate for describing computing tasks.

Second, why does SQL not work?

2.1 Complex SQL statements

Look at the simple questions first.

SQL is written very much like English, and some queries can be read and written in English (there are too many on the Internet, so I won't give examples), which should be considered as simple to write.

Wait a minute! The SQL we see in textbooks is often only two or three lines, and these SQLs are indeed written to be simple, but what if we try something slightly more complicated?

Here's a not-so-complicated example: Calculate the longest streak of days a stock has gone up? Written in SQL like this:

select max (consecutive_day)
from (select count(*) (consecutive_day
      from (select sum(rise_mark) over(order by trade_date) days_no_gain
            from (select trade_date,
                         case when closing_price>lag(closing_price) over(order by trade_date)
                              then 0 else 1 END rise_mark
                  from stock_price ) )
      group by days_no_gain)

The working principle of this statement will not be explained. Anyway, it is a bit confusing. Students can try it for themselves.

This is a recruitment test question of Runqian Company, and the pass rate is less than 20%; because it is too difficult, it was changed to another way: write out the SQL statement to let the applicant explain what it is, and the pass rate is still not high.

What does this mean? Explain that the situation is a little more complicated, and SQL becomes difficult to understand and write!

Looking at the problem of running fast, it is still a simple example that is often taken out: taking the top 10 out of 100 million pieces of data. This task is not complicated to write in SQL:

SELECT TOP 10 x FROM T ORDER BY x DESC

However, the execution logic corresponding to this statement is to first sort all the data, and then take out the first 10, and don't need the latter. As we all know, sorting is a very slow action, and it will traverse the data many times. If the amount of data is too large to fit in the memory, then external memory is needed for caching, and the performance will further drop sharply. If you strictly follow the logic embodied in this SQL sentence, this operation will not run fast anyway. However, many programmers know that this operation does not require large sorting, and does not require external memory caches. One traversal can be completed with a little memory, that is, there are higher performance algorithms. It is a pity that such an algorithm cannot be written in SQL. We can only hope that the optimizer of the database is smart enough to convert this SQL into a high-performance algorithm for execution. However, the optimizer of the database may not be reliable when the situation is complicated. .

It seems that SQL is not good enough at both. These two uncomplicated problems are the same. In reality, thousands of lines of SQL code are difficult to write and run fast.

Why doesn't SQL work?

2.2 What is the purpose of computing with program code?

To answer this question, we need to analyze what exactly we are doing with program code to implement computation.

Essentially, the process of writing a program is the process of translating a problem-solving idea into a computer-executable precise formal language . For example, just like elementary school students solving a problem, after analyzing the problem and coming up with a solution, they also need to list four arithmetic expressions. The same is true of computing with a program. Not only do you have to come up with a solution to the problem, but you must also translate the solution into an action that the computer can understand and execute.

The formal language used to describe computational methods, the core of which lies in the algebraic system used. The so-called algebra system is simply some data types and the operation rules on them. For example, the arithmetic learned in elementary school is integers and addition, subtraction, multiplication and division operations. With this set of things, we can write the operations we want to do with the symbols agreed by this algebraic system, that is, code, and then the computer can execute it.

If the design of this algebraic system is not thoughtful, and the data types and operations provided are inconvenient, it will make it very difficult to describe the algorithm. At this time, a strange phenomenon occurs: the difficulty of translating the solution into the code is far more than solving the problem itself .

For example, since we were young, we learned to use Arabic numerals for daily calculations. It is very convenient to do addition, subtraction, multiplication and division. Everyone naturally believes that numerical operations should be like this. Not necessarily! It is estimated that many people know that there is another thing called Roman numerals. Do you know how to do addition, subtraction, multiplication and division with Roman numerals? How did the ancient Romans go to the streets to buy food?

The difficulty of writing code is largely a matter of algebra .

2.3 Reasons for not running fast.

Software can't change the performance of hardware, CPU and hard disk are as fast as they should be. However, we can design low-complexity algorithms, that is, algorithms that require less computation, so that the computer performs fewer actions, and naturally it will be faster. However, it is not enough to just come up with an algorithm. It is also necessary to write the algorithm in some formal language, otherwise the computer will not execute it. Moreover, it is relatively simple to write, it is very long and troublesome to write, and no one will use it. Therefore, for programs, running fast and writing simple are actually the same problem , and behind it is the problem of algebra used in this formal language. If this algebra is not good, it will make it difficult or even impossible to implement high-performance algorithms, and there is no way to run fast. As mentioned above, we can't write the single traversal algorithm we expect in small memory with SQL. Whether it can run fast can only be hoped by the optimizer.

2.4 Make an analogy:

Students who have been to elementary school probably know the little story of Gauss's calculation of 1+2+3+...+100. Ordinary people just add 100 times step by step. The Gaussian child is very smart. He finds that 1+100=101, 2+99=101, ..., 50+51=101, the result is 50 times 101, and the calculation will be completed soon. .

After hearing this story, we will all feel that Gauss is very smart and can think of such an ingenious way, that is, simple and fast. This is not wrong, but it is easy to overlook one point: in the era of Gauss, the human arithmetic system (also an algebra) already has multiplication ! As mentioned earlier, when we learn four arithmetic operations since childhood, we will take multiplication for granted, but it is not! Multiplication was invented after addition. If the age of the Gaussian doesn't have multiplication yet, even with a clever Gaussian, there is no way to quickly solve this problem.

At present, the mainstream database is a relational database. It is so called because its mathematical foundation is called relational algebra , and SQL is a formal language developed theoretically from relational algebra.

Now we can answer, why doesn't SQL do well enough in both of the expected ways? The problem lies in relational algebra. Relational algebra is like an arithmetic system that only has addition and has not invented multiplication. It is inevitable that many things cannot be done well.

Relational algebra has been invented 50 years ago. The application requirements and hardware environment of 50 years ago are very different from today's. To continue to use the theory of 50 years ago to solve today's problems, it feels too much to listen to. outdated? However, the reality is that, due to the large number of users and the lack of mature new technologies, SQL based on relational algebra is still the most important database language today. Although there have been some improvements and improvements over the past few decades, the fundamentals have not changed. In the face of contemporary complex requirements and hardware environments, it is reasonable for SQL to be incompetent.

And, unfortunately, this problem is theoretical, and no amount of optimization in engineering will help, only limited improvement, not eradication. However, most database developers don't think about this layer, or in order to take care of the compatibility of existing users, they don't plan to think about this layer. As a result, the mainstream database industry has been spinning around in this circle.

3. Why does SPL work?

3.1 Discrete datasets

So how do you make calculations easier to write and faster to run?

Invent new algebra ! Algebra with "multiplication". Design a new language based on it.

That's where SPL comes in. Its theoretical basis is no longer relational algebra, called discrete data sets . The formal language based on this new algebraic design is named SPL (Structured Process Language).

SPL revolutionized the shortcomings of SQL (or, more precisely, discrete datasets against the various shortcomings of relational algebra). SPL redefines and extends many operations in structured data, increasing discreteness, strengthening ordered computation, implementing thorough aggregation, supporting object references, and advocating stepwise operations.

There is a direct feeling to rewrite the previous problem in SPL.

How many consecutive days can a stock rise in a row:
insert image description here

stock_price.sort(trade_date).group@i(closing_price<closing_price[-1]).max(~.len())

The calculation idea is the same as the previous SQL, but because of the introduction of orderliness, it is much easier to express and no longer go around.

The top 10 from the 100 million pieces of data:

T.groups(;top(-10,x))

SPL has a richer collection data type, which is easy to describe efficient algorithms that implement simple aggregations on a single traversal, without involving large sorting actions.

Due to space limitations, the full picture of SPL (discrete data set) cannot be introduced here. Here we enumerate some of the differentiated improvements of SPL (discrete datasets) to SQL (relational algebra):

3.2 Free recording

A record in a discrete dataset is a basic data type that can exist independently of a data table. A data table is a collection of records, and the records that make up one data table can also be used to make up other data tables. For example, the filtering operation is to use the records in the original data table that meet the conditions to form a new data table, so that it has more advantages in terms of space occupation and computing performance.

Relational algebra has no operable data types to represent records. A single record is actually a data table with only one row, and records in different data tables cannot be shared. For example, when filtering operations, new records will be copied to form a new data table, and the cost of space and time will increase.

In particular, because there are free records, discrete data sets allow the field value of a record to be a certain record, which makes it easier to implement foreign key connections.

3.3 Orderliness

Relational algebra is designed based on unordered sets. Set members have no concept of ordinal numbers, nor do they provide a mechanism for positioning calculations and adjacent references. In the practice of SQL, some partial improvements have been made in engineering, so that modern SQL can easily perform some ordered operations.

The set in the discrete data set is ordered, and the members of the set all have the concept of ordinal number. The members can be accessed by the ordinal number, and the positioning operation is defined to return the ordinal number of the member in the set. Discrete datasets provide notation to implement adjacent references in set operations and support computations against an ordinal position in the set.

Ordered operations are common, but have always been a difficult problem for SQL, even with window functions. SPL has greatly improved the situation, as the previous example of a rising stock illustrates.

3.4 Discreteness and Aggregation

Rich set operations are defined in relational algebra, that is, the set can be used as a whole to participate in operations, such as aggregation and grouping. This is where SQL is more convenient than high level languages ​​like Java.

But the discreteness of relational algebra is very poor, and there are no free records. High-level languages ​​such as Java have no problem in this regard.

Discrete data sets are equivalent to combining discreteness and aggregation. There are not only set data types and related operations, but also set members that operate independently outside the set or form other sets. It can be said that SPL concentrates the advantages of both SQL and Java.

Ordered operations are a typical combination of discreteness and aggregation. The concept of order is meaningful only in sets, and there is no order for a single member, which reflects aggregation; and ordered calculations need to be calculated for a member and its adjacent members, which requires discreteness.

It is only with the support of discreteness that a more thorough aggregation can be achieved to solve problems such as ordered computation types.

Discrete datasets are algebraic systems that are both discrete and aggregated, while relational algebra only aggregates.

3.5 Group understanding

The original intention of the grouping operation is to split a large set into several sub-sets according to certain rules. There is no data type in relational algebra that can represent the set of sets, so it is forced to perform aggregation operations after grouping.

The set of allowed sets in discrete data sets can represent reasonable grouping operation results. The grouping and aggregation after grouping are split into two independent operations, so that more complex operations can be performed on the grouped subsets.

In relational algebra, there is only one kind of equivalence grouping, that is, the set is divided according to the grouping key value, and the equivalence grouping is a complete division.

Discrete data sets consider any method of splitting a large set to be a grouping operation. In addition to regular equal-value grouping, it also provides ordered grouping combined with ordering, and bitwise grouping that may result in incomplete division results.

3.6 Aggregate understanding

There is no explicit set data type in relational algebra. The results of aggregation calculations are all single-valued, and the same is true for aggregation operations after grouping. There are only SUM, COUNT, MAX, MIN and so on. In particular, relational algebra cannot regard the TOPN operation as an aggregation. For the TOPN of the complete set, only the first N items can be sorted after outputting the result set, while it is difficult to achieve the TOPN for the grouped subset, and it is necessary to change the way of thinking to spell out the serial number. to complete.

Discrete data sets advocate universal sets, and the result of aggregation operations is not necessarily a single value, but may still be a set. In discrete data sets, the TOPN operation is equivalent to SUM and COUNT, that is, it can be used for the whole set or for the grouped subset.

After SPL understands TOPN as an aggregation operation, it can also avoid the sorting of the full amount of data during engineering implementation, thereby achieving high performance. And SQL's TOPN always accompanies the ORDER BY action, which theoretically requires a large order to achieve, and it is necessary to hope that the database will be optimized during the project implementation.

3.7 High performance with in-order support

Discrete datasets place special emphasis on ordered sets, and many high-performance algorithms can be implemented using ordered features. This is incapable of relational algebra based on unordered sets, and can only hope for engineering optimization.

The following are some low-complexity operations that can be implemented using ordered features:

1) The data table is ordered by the primary key, which is equivalent to having an index naturally. Filtering of key fields can often be quickly located to reduce the amount of out-of-memory traversal. The binary method can also be used to locate the random key value, and the index information can be reused when taking multiple key values ​​at the same time.

2) The usual grouping operation is implemented by the HASH algorithm. If we know for sure that the data is in order with the grouping key value, we can only do adjacent comparisons to avoid calculating the HASH value, and there will be no HASH conflict. Easy to parallelize.

3) The data table has an orderly key, and the alignment between the two large tables can perform a higher-performance merge algorithm. As long as the data is traversed once, there is no need to cache, and the memory usage is very small; while the traditional HASH value heap method Not only is it more complex, it requires larger memory and external caching, but it may also cause secondary HASH re-cache due to improper HASH function.

4) The large table is used as a connection to the foreign key table. When the fact table is small, the foreign key table can be used in an orderly manner, and the data corresponding to the associated key value can be quickly retrieved from it to realize the connection, without the need for HASH heaping action. When the fact table is also very large, the foreign key table can be divided into multiple logical segments by quantile points, and then the fact table can be divided into logical segments. When there is a possible secondary sub-stacking when HASH sub-stacking occurs, the computational complexity can be greatly reduced.

Among them, 3 and 4 use the transformation of discrete data sets to join operations. If the definition of relational algebra is still used (may produce many-to-many), it is difficult to implement such a low-complexity algorithm.

In addition to theoretical differences, SPL also has many engineering-level advantages, such as easier to write parallel code, large memory pre-association to improve foreign key join performance, etc., unique column storage mechanism to support random segment parallelism, etc.

Here are more SPL codes to reflect its ideas and big data algorithms:

SPL download address: http://c.raqsoft.com.cn/article/1595816810031
SPL open source address: https://github.com/SPLWare/esProc

Guess you like

Origin blog.csdn.net/weixin_39709134/article/details/124265279