MySQL Condensed Notes (1) - Basics

Main quotes from the article notes:

Axiu’s study notes (interviewguide.cn)

Kobayashi coding (xiaolincoding.com)

How much do you know about the difference between relational and non-relational databases?

  • Advantages of relational databases
    • Easy to understand. Because it uses a relational model to organize data.
    • Data consistency can be maintained.
    • The overhead of data update is relatively small.
    • Supports complex queries (queries with where clause)
  • Advantages of non-relational databases
    • It does not need to be parsed by the SQL layer, and the reading and writing efficiency is high.
    • Based on key-value pairs, the data is very scalable .
    • It can support the storage of various types of data, such as pictures, documents, etc.

What is a non-relational database?

Non-relational databases are also called NOSQL and are stored in the form of key-value pairs .

It has high read and write performance and is easy to expand. It can be divided into in-memory database and document database, such as Redis, Mongodb, HBase, etc.

Scenarios suitable for using non-relational databases:

  • Logging system
  • Geolocation storage
  • Huge amount of data
  • High availability

Let’s talk about how MySQL executes a SQL? What are the specific steps?

The steps for the Server layer to execute SQL in sequence are:

  1. Client request ->
  2. Connector (verify user identity, grant permissions) ->
  3. Query cache (return directly if cache exists, perform subsequent operations if cache does not exist) ->
  4. Analyzer (lexical analysis (building syntax tree) and syntax analysis operations on SQL) ->
  5. Preprocessor (check whether the table or field in the SQL query statement exists ; expand the symbol select *in *to all columns on the table;) ->
  6. Optimizer (mainly selects the optimal execution plan method for executing SQL optimization) ->
  7. Executor (when executing, it will first check whether the user has execution permission, and then use the interface provided by this engine) ->
  8. Go to the engine layer to obtain data returns (if query caching is enabled, the query results will be cached)

Do you understand the internal structure of MySQL? What two parts can it generally be divided into?

It can be divided into two parts: server layer and storage engine layer, among which:

The server layer includes connectors, query caches, parsers, preprocessors, optimizers, executors, etc. , covering most of MySQL's core service functions, as well as all built-in functions (such as date, time, mathematics and encryption functions, etc.). All cross-storage engine functions are implemented in this layer, such as stored procedures, triggers, views, etc.

The storage engine layer is responsible for data storage and retrieval . Its architectural model is plug-in and supports multiple storage engines such as InnoDB, MyISAM, and Memory. The most commonly used storage engine now is InnoDB, which has become the default storage engine since MySQL 5.5.5.

Do you know about MySQL optimization? Tell me about the aspects in which performance optimization can be achieved?

  • Create an index for a search field
  • Avoid using Select * and list the fields that need to be queried
  • Vertically split table
  • Choose the right storage engine

Ever heard of views? What about the cursor?

A view is a visual table based on the result set of a SQL statement.

A cursor is a data buffer opened by the system for users to store the execution results of SQL statements. Each cursor area has a name. Users can use SQL statements to obtain records from the cursor one by one and assign them to the main variables for further processing by the main language.

What is the role of views? Can it be changed?

A view is a virtual table. Unlike a table that contains data, a view only contains queries that dynamically retrieve data when used; it does not contain any columns or data. Using views can simplify complex SQL operations, hide specific details, and protect data; once views are created, they can be utilized in the same way as tables.

Views cannot be indexed, nor can they have associated triggers or default values. If there is order by in the view itself, order by again on the view will be overwritten.

Create view: create view xxx as xxxx

Some views, such as the Distinct Union that does not use the join subquery grouping aggregation function, can be updated. Updates to the view will update the base table; however, the view is mainly used to simplify retrieval and protect data, and is not used for updates. , and most views cannot be updated.

What is the difference between InnoDB and MyISAM, the common storage engines of MySQL? What are the applicable scenarios?

1) Transaction: MyISAM does not support it, but InnoDB supports it 2) Lock level: MyISAM table-level lock, InnoDB row-level lock and foreign key constraints 3) MyISAM stores the total number of rows in the table; InnoDB does not store the total number of rows; 4) MyISAM uses non-clustered indexes , B+ tree leaves store pointers to data files. InnoDB's primary key index uses a clustered index, and B+ tree leaves store data; 5) Backup: InnoDB supports online hot backup. To obtain a consistent view, you do not need to stop writing to all tables. MyISAM does not support it.

Applicable scenarios : MyISAM is suitable for: Insertions are infrequent and queries are very frequent. If a large number of SELECTs are executed, MyISAM is a better choice and there are no transactions. InnoDB is suitable for: Reliability requirements are relatively high, or transactions are required; Table updates and queries are quite frequent, and a large number of INSERT or UPDATE

What methods do you know about database structure optimization?

  • Paradigm optimization : such as eliminating redundancy (saving space...)
  • Anti-paradigm optimization : such as adding appropriate redundancy (reducing joins)
  • Limit the scope of data : Be sure to prohibit query statements without any conditions that limit the scope of data. For example: when users query order history, we can control it within a month.
  • Read/write separation : Classic database splitting scheme, the master database is responsible for writing, and the slave database is responsible for reading;
  • Split table : Partitions physically separate data, and data in different partitions can be stored in data files on different disks. In this way, when querying this table, you only need to scan the table partition instead of scanning the entire table, which significantly shortens the query time. In addition, the partitions on different disks will also disperse the data transmission of this table in different places. Disk I/O, a carefully configured partition can evenly spread the data transfer competition for disk I/O. This method can be used for timetables with large amounts of data. Table partitions can be automatically created on a monthly basis.

Why does the database need to be divided into databases and tables? Isn't it possible to put them all in one library or one table?

The purpose of sharding databases and tables is to reduce the burden of a single database and a single table on the database, improve query performance, and shorten query time .

By splitting tables , the burden of a single table on the database can be reduced. At the same time, query performance is improved because the amount of data on different tables is reduced . In addition, the table lock problem can be greatly alleviated. The table splitting strategy can be summarized as vertical splitting and horizontal splitting:  Horizontal table splitting : Modular table splitting belongs to random table splitting, while time dimension table splitting belongs to continuous table splitting. How to design a vertical split , my suggestion: split the infrequently used fields into another extended table. Split the fields with large text into another extended table separately, and put the fields that are not frequently modified in the same table. In one table, put frequently changing fields in another table. For scenarios with a large number of users, you can consider taking the modulus and splitting the tables. The data will be relatively uniform, and hot spots and concurrent access bottlenecks will not easily occur.

Splitting tables within the database only solves the problem of too large data in a single table , but it does not disperse the data in a single table to different physical machines. Therefore, it does not reduce the pressure on the MySQL server. There is still competition for resources on the same physical machine. and bottlenecks, including CPU, memory, disk IO, network bandwidth, etc.

Distributed dilemmas and countermeasures caused by sharding and sharding  Data migration and expansion problems - the general approach is to read the data through the program first, and then write the data to each sharding table according to the specified sharding strategy middle. Pagination and sorting issues—the data needs to be sorted and returned in different sub-tables, and the result sets returned by different sub-tables are summarized and sorted again, and finally returned to the user.

One of the more common methods in database optimization is to split the data table. What do you know about splitting the data table?

Splitting is actually divided into vertical splitting and horizontal splitting.

Case: The simple shopping system temporarily involves the following table:

1. Product table (data volume 10w, stable)

2. Order table (data volume: 2 million, with a growing trend)

3. User table (data volume: 1 million, with a growing trend)

Take MySQL as an example to describe horizontal splitting and vertical splitting. The order of magnitude that MySQL can tolerate ranges from millions to tens of millions of static data.

split vertically

Solve the problem: io competition between tables

Not solving the problem: the pressure caused by the growth of data volume in a single table

Solution: Put the product table and user table on one server, and put the order table on a separate server.

split horizontally

Solve the problem: the pressure caused by the growth of data volume in a single table

Does not solve the problem: io contention between tables

Solution: The user table  is split into a male user table and a female user table by gender, the order table  is split into completed orders and uncompleted orders by completed and completed, the product table and  unfinished orders are placed on a server, and the completed order table is Put the table of male users on one server, and the table of female users on another server (women love shopping haha).

A scenario question: If your company chooses the MySQL database for data storage, with an increment of more than 50,000 items a day, and expected operation and maintenance for three years, what optimization methods do you have?

  • A well-designed database structure allows partial data redundancy and avoids join queries to improve efficiency.
  • Select the appropriate table field data type and storage engine , and add indexes appropriately.
  • MySQL database master-slave reading and writing separation .
  • Find regular tables and reduce the amount of data in a single table to improve query speed.
  • Add caching mechanisms , such as Memcached, Apc, etc.
  • For pages that do not change frequently, static pages are generated .
  • Write efficient SQL. For example, SELECT * FROM TABEL is changed to SELECT field_1, field_2, field_3 FROM TABLE.

What are the primary keys, super keys, candidate keys, and foreign keys in the database? (great)

  • Super key : The set of attributes that can uniquely identify a tuple in a relationship is called the super key of the relational schema.

  • Candidate key : A superkey that does not contain redundant attributes is called a candidate key. That is to say, in the candidate key, if the attribute is deleted, it will no longer be a key!

  • Primary key : A candidate key selected by the user as a tuple identifier is called the primary key

  • Foreign key : If attribute K in relational schema R is the primary key of another schema , then k is called a foreign key in schema R.

Example :

student ID Name gender age Department major
20020612 Li Hui male 20 computer software development
20060613 Zhang Ming male 18 computer software development
20060614 Wang Xiaoyu female 19 physics Mechanics
20060615 Li Shuhua female 17 biology zoology
20060616 Zhao Jing male 21 Chemical food chemistry
20060617 Zhao Jing female 20 biology botany
  1. Super key: So we can find from the example that the student ID is the unique identifier that identifies the student entity. Then the super key of this tuple is the student number. In addition, we can also combine it with other attributes, such as: ( 学号, 性别), ( 学号, 年龄)
  2. Candidate key: According to the example, the student number is a unique identifier that can uniquely identify a tuple, so the student number is a candidate key. In fact, the candidate key is a subset of the super key, such as (student number, age) is a super key , but it is not a candidate key. Because it has additional properties.
  3. Primary key: To put it simply, the candidate key of the tuple in the example is the student number, but if we select it as the unique identifier of the tuple, then the student number will be the primary key.
  4. The foreign key is relative to the primary key. For example, in the student record, the primary key is the student number, and there is also a student number field in the transcript table. Therefore, the student number is the foreign key of the transcript table and the primary key of the student table.

The primary key is a subset of the candidate key, the candidate key is a subset of the super key, and the determination of the foreign key is relative to the primary key.

An in-depth introduction to the three paradigms of databases

first normal form

In any relational database, the first normal form (1NF) is the basic requirement for the relational model . A database that does not meet the first normal form (1NF) is not a relational database. The so-called first normal form (1NF) means that each column of the database table is an indivisible basic data item . There cannot be multiple values ​​in the same column, that is, an attribute in the entity cannot have multiple values ​​or duplicate attributes. .

If repeated attributes appear, you may need to define a new entity. The new entity is composed of repeated attributes. There is a one-to-many relationship between the new entity and the original entity. In first normal form (1NF) each row of the table contains information about only one instance.

In short, first normal form is a column without duplicates .

second normal form

The second normal form (2NF) is established on the basis of the first normal form (1NF), that is, to satisfy the second normal form (2NF), the first normal form (1NF) must be satisfied first. Second Normal Form (2NF) requires that each instance or row in a database table must be uniquely distinguishable .

To achieve differentiation, it is usually necessary to add a column to the table to store the unique identification of each instance . This unique attribute column is called the primary key or primary key or primary key. Second Normal Form (2NF) requires that the attributes of an entity depend entirely on the primary key.

The so-called complete dependence means that there cannot be an attribute that only depends on part of the primary key. If it exists, then this attribute and this part of the primary key should be separated to form a new entity. The new entity and the original entity are one-to-many. relation. To achieve differentiation, it is usually necessary to add a column to the table to store the unique identification of each instance.

In short, there is a primary key, and non-primary key fields depend on the primary key .

third normal form

To satisfy the third normal form (3NF), you must first satisfy the second normal form (2NF). In short, third normal form (3NF) requires that a database table does not contain non-primary key information that is already contained in other tables.

For example, there is a department information table, in which each department has department number (dept_id), department name, department profile and other information. Then after the department number is listed in the employee information table, department name, department profile and other department-related information cannot be added to the employee information table. If the department information table does not exist, it should be constructed according to the third normal form (3NF), otherwise there will be a lot of data redundancy.

In short, non-primary key fields cannot depend on each other.

1NF: Atomicity. Fields cannot be subdivided, otherwise it is not a relational database; 

2NF: uniqueness. A table only describes one thing;   

3NF: Each column has a direct relationship with the primary key, and there is no transitive dependency.

How does the database ensure durability?

Mainly using Innodb's redo log . Rewriting the log, as mentioned before, MySQL first loads the data on the disk into the memory, modifies the data in the memory, and then writes it back to the disk. If the machine crashes suddenly at this time, the data in the memory will be lost. how to solve this problem? It's simple, just write the data directly to the disk before the transaction is committed. What's wrong with doing this?

  • To modify only one byte in a page, the entire page must be flushed to disk, which is a waste of resources. After all, a page is 16kb in size. If you only change a little bit of it, 16kb of content will be flushed to the disk, which doesn't sound reasonable.
  • After all, the SQL in a transaction may involve the modification of multiple data pages, and these data pages may not be adjacent, that is, they belong to random IO. Obviously, operating random IO will be slower.

Therefore, we decided to use redo log to solve the above problem. When data is modified, not only the operation is performed in memory, but the operation is also recorded in the redo log . When the transaction is committed, the redo log will be flushed ( part of the redo log is in memory and part is on disk). When the database is down and restarted, the contents in the redo log will be restored to the database, and then the data will be rolled back or submitted based on the contents of the undo log and binlog .

What are the benefits of using redo log?

In fact, the advantage is that flushing the redo log is more efficient than flushing the data page. The specific performance is as follows:

  • The redo log is small in size. After all, it only records which page has been modified. Therefore, the redo log is small in size and can be refreshed quickly.
  • The redo log is appended to the end and belongs to sequential IO. The efficiency is obviously faster than random IO.

High database concurrency is something we often encounter. Do you have any good solutions?

  • Add caching to the web service framework . Add a cache layer between the server and the database layer to store frequently accessed data in the cache to reduce the reading burden on the database.
  • Increase database indexes to increase query speed. (However, too many indexes will slow down the speed, and writing to the database will cause the index to be updated, which will also slow down the speed)
  • Master-slave reading and writing are separated , so that the master server is responsible for writing and the slave server is responsible for reading.
  • Split the database to make the database table as small as possible to improve query speed.
  • Use distributed architecture to spread computing pressure.

How to ensure that the ID of each table is unique after MySQL partitions the database and tables?

You can use the snowflake algorithm to generate distributed IDs. It will generate an  64 bit integer , which can ensure the non-repetition of the primary keys of different processes and the orderliness of the primary keys of the same process.

picture

snowflake algorithm

In the same process, it first ensures non-duplication through the time bit , and if the times are the same, it guarantees it through the sequence bit . At the same time, because the time bits are monotonically increasing, and if each server is generally time synchronized, the generated primary keys can be considered generally ordered in the distributed environment, which ensures the efficiency of inserting index fields.

However, the snowflake algorithm has shortcomings. The snowflake algorithm is strongly dependent on time, and if the machine time is dialed back , duplicate IDs may be generated. You can use the distributed ID solution Leaf provided by Meituan, which does not rely on timestamps.

 

Guess you like

Origin blog.csdn.net/shisniend/article/details/131869669