Summary of Mysql and ES data synchronization schemes

foreword

In actual project development, we often use Mysql as a business database and ES as a query database to realize the separation of reading and writing, relieve the query pressure of the Mysql database, and deal with complex queries of massive data. One of the very important issues is how to realize the data synchronization between Mysql database and ES. Today, I will talk to you about various solutions for data synchronization between Mysql and ES.

1. The characteristics of Mysql and ES

Why choose Mysql

MySQL does not have a particularly advantageous position in the history of relational databases. Oracle/DB2/PostgreSQL (Ingres) was developed 20 years earlier than MySQL. However, taking advantage of the Internet boom in 2000, the LAMP architecture was quickly used, especially in In China, the master data of IT systems of most emerging enterprises are deposited in MySQL.

  • Core features: open source free, high concurrency, stability, support transactions, support SQL query

  • High concurrency: MySQL kernel features are especially suitable for high-concurrency simple SQL operations, lightweight links (thread mode), optimizers, executors, and transaction engines are relatively simple and rough, and storage engines are more detailed

  • Good stability: The biggest requirement for the main database is to be stable and not lose data. The features of the MySQL kernel make it distinctive, so as to achieve good stability. The main and backup systems are also ready very early to cope with rapid switching in case of a crash. The innodb storage engine also ensures the stability of the MySQL disk

  • Convenient operation: good and convenient user experience (compared to PostgreSQL), making it very easy for application developers to get started, and the learning cost is low

  • Open source ecology: MySQL is an open source product, which makes it relatively simple for upstream and downstream manufacturers to build tools around it. HAproxy, sub-database and sub-table middleware greatly enhance its practicability, and at the same time, the characteristics of open source allow it to have a large number of users

Why choose ES

Several notable features of ES can effectively make up for the shortcomings of MySQL in enterprise-level data operation scenarios, and this is also an important reason why we choose it as a downstream data source

  • Core features: support word segmentation search, good multi-dimensional filtering performance, support massive data query
  • Text search capability: ES is a search system based on inverted index. With a variety of tokenizers, it performs better in text fuzzy matching search and has a wide range of business scenarios.
  • Good multi-dimensional filtering performance: billion-scale data is pre-built using wide tables (eliminating joins), combined with full-field indexes, so that ES has an overwhelming advantage in multi-dimensional filtering capabilities, and this ability is such as CRM, BOSS, MIS and other enterprise operating
    systems Core appeal, plus text search capability, unique
  • Open source and business in parallel: ES
    open source ecology is very active, with a large number of user groups, and behind it is also supported by independent commercial companies, which allows users to have more diverse and gradual choices according to their own characteristics

2. Data synchronization scheme

1. Synchronous double write

This is the easiest way to write data to ES while writing data to mysql.
insert image description here

pseudocode:

   /**
     * 新增商品
     */
    @Transactional(rollbackFor = Exception.class)
    public void addGoods(GoodsDto goodsDto) {
    
    
         //1、保存Mysql
         Goods goods = new Goods();
         BeanUtils.copyProperties(goodsDto,goods);
         GoodsMapper.insert();
     
         //2、保存ES
         IndexRequest indexRequest = new IndexRequest("goods_index","_doc");
         indexRequest.source(JSON.toJSONString(goods), XContentType.JSON);
         indexRequest.setRefreshPolicy(WriteRequest.RefreshPolicy.IMMEDIATE);
         highLevelClient.index(indexRequest);
    }

advantage:

1. Simple business logic
2. High real-time performance

shortcoming:

1. Hard-coded, where you need to write to mysql, you need to add the code to write into ES;
2. Strong business coupling;
3. There is a risk of data loss due to double-writing failure;
4. Poor performance: the performance of mysql is not very high , plus an ES, the performance of the system will inevitably decline.

Attachment:
The risks of double writing failure mentioned above include the following:
1) The ES system is unavailable;
2) The network failure between the program and ES;
3) The program restarts, causing the system to fail to write into ES, etc.
In view of this situation, if there is a strong data consistency requirement, it must be double-written into the transaction for processing, and once the transaction is used, the performance degradation will be more obvious.

2. Asynchronous double write (MQ mode)

For the scenario of multi-data source writing, MQ can be used to implement asynchronous multi-source writing. In this case, the writing logic of each source does not interfere with each other, and the abnormal or slow writing of a single data source will not affect other data sources. Writing, although the overall writing throughput has increased, but because MQ consumption is asynchronous consumption, it is not suitable for real-time business scenarios.
insert image description here

advantage:

1. High performance
2. Data loss is not easy to occur. It is mainly based on the consumption protection mechanism of MQ messages. For example, ES downtime or write failure can also consume MQ messages again.
3. Multi-source writes are isolated from each other, making it easy to expand more data source writes

shortcoming:

1. Hard-coded problems, access to new data sources requires the implementation of new consumer codes
3. Increased system complexity: the introduction of message middleware
4. Possible delay problems: MQ is an asynchronous consumption model, and the data written by users It may not be immediately visible, causing a delay.

3. Scheduled scan synchronization based on Mysql table

There are hard-coded problems in the above two solutions, that is, any additions, deletions, modifications, and checks to mysq are either implanted with ES codes or replaced with MQ codes, and the codes are too intrusive.
If the real-time requirements are not high, you can consider using a timer to handle it. The specific steps are as follows:
1. Add a field named timestamp to the related table of the database. Any crud operation will cause the time of this field to change;
2. The CURD operation in the original program does not make any changes;
3. Add a timer program, let the program scan the specified table according to a certain time period, and extract the data that has changed within the time period; 4. Write item by
item into the ES.

As shown below:
insert image description here

The typical implementation of this solution is to logstashachieve data synchronization. The underlying implementation principle is to regularly use SQL query to write new data into ES according to the configuration, so as to realize incremental synchronization of data.

For specific implementation, please refer to: Timing incremental synchronization of mysql data to ES through Logstash

insert image description here

advantage:

1. No change to the original code, no intrusion, no hard coding;
2. No strong business coupling, no change to the performance of the original program;
3. Worker code is easy to write and does not need to consider adding, deleting, modifying and checking;

shortcoming:

1. The timeliness is poor. Since the timer is used to synchronize data according to the fixed frequency lookup table, even if the synchronization period is set to the second level, there will still be a certain time delay.
2. There is a certain polling pressure on the database. One way to improve it is to put the polling on the slave database with little pressure.

4. Real-time synchronization based on Binlog

The above three solutions either have code intrusion, hard coding, or delay, so is there a solution that can ensure real-time data synchronization without substitution intrusion?
Of course, you can use mysql's binlog for synchronization. Its realization principle is as follows:
insert image description here

The specific steps are as follows:
1) Read the binlog log of mysql to obtain the log information of the specified table;
2) Convert the read information to MQ;
3) Write an MQ consumption program;
4) Consume MQ continuously, every time a message is consumed , write the message to ES.

advantage:

1. No code intrusion, no hard coding;
2. The original system does not need any changes, no perception;
3. High performance;
4. Business decoupling, no need to pay attention to the business logic of the original system.

shortcoming:

1. The construction of the Binlog system is complex;
2. If MQ is used to consume and analyze the binlog information, there will be a risk of MQ delay like the second solution.
The currently popular solution in the industry: use canal to monitor binlog to synchronize data to es

canal, which translates to waterways/pipes/ditches, is mainly used to analyze incremental logs based on the MySQL database and provide incremental data subscription and consumption.
To put it bluntly, data is incrementally synchronized according to Mysql's binlog log. To understand the principle of canal, you must first understand the master-slave replication principle of mysql:
1. All create update delete operations will enter the MySQL master node
2. The master node will generate a binlog file, and each operation of the mysql database will be recorded in the binlog file
3. The slave node will subscribe to the binlog file of the master node, and synchronize the data to the slave data in the form of incremental backup

canal原理就是伪装成mysql的从节点,从而订阅master节点的binlog日志, the main process is:
1. The canal server transmits the dump protocol to the mysql master node
2. After receiving the dump request, the mysql master node pushes the binlog log to the canal server, parses the binlog object (originally byte stream) and converts it into Json format
3. The canal client listens to the canal server through the TCP protocol or MQ, and synchronizes data to ES

3. Selection of Data Migration Synchronization Tool

There are many options for data migration and synchronization tools. The following table only compares some data synchronization tools that the author has used and researched in the scenario of synchronizing ES with MySQL. Users can choose the product that suits them according to their actual needs.

Features\Products Canal DTS CloudCanal
Whether to support self-built ES yes no yes
ES peer version supports richness
ES6 and ES7 are supported in
High
support for ES5, ES6 and ES7

ES6 and ES7 are supported in
Nested type support join/nested/object object nested/object
join support method Based on join parent-child document & anti-check none Pre-build & check back based on wide tables
Whether to support structure migration no yes yes
Whether to support full migration yes yes yes
Whether to support incremental migration yes yes yes
Data Filtering Capabilities Where conditions can only be added in
full
High
full incremental stage where condition
High
full incremental stage where condition
Whether to support time zone conversion no yes yes
Synchronous current limiting capability none have have
Task Editing Capabilities none have none
Data source support richness middle high middle
architectural pattern The subscription consumption mode
needs to be written into the message queue first
direct mode direct mode
Richness of Monitoring Indicators Medium
performance indicator monitoring
Medium
performance indicator monitoring
High performance
indicators, resource indicator monitoring
Alarm capability none Alarm for delayed and abnormal phone calls DingTalk, SMS, and email alarms for delays and abnormalities
Task visualization creation & configuration & management capabilities none have have
Is it open source yes no no
Is it free yes no It is
the community edition, and the SAAS edition is free
Whether to support independent output yes Whether to
rely on the overall output of the cloud platform
yes
Whether to support the use of SAAS no yes yes

Summarize

This article mainly summarizes the common solutions for data synchronization between Mysql and ES.

  • 同步双写是最简单的同步方式, which can guarantee the real-time performance of synchronous writing of data to the greatest extent. The biggest problem is that the code is too intrusive.
  • 异步双写The introduction of message middleware, because MQ is an asynchronous consumption model, so the problem of data synchronization delay may occur. The advantage is that the throughput is higher and the performance is better during large-scale message synchronization, and it is convenient to access more data sources, and the data consumption and writing of each data source are isolated from each other and do not affect each other.
  • 基于Mysql表定时扫描同步, the principle is to perform data synchronization by periodically scanning incremental data in the table through a timer, which will not cause code intrusion. However, due to the timing scan synchronization, there will also be data synchronization delay problems. A typical implementation is to use Logstash to achieve incremental synchronization.
  • 基于Binlog实时同步, the principle is to incrementally synchronize data by listening to Mysql's binlog log. There will be no code intrusion, and the real-time data synchronization can be guaranteed. The disadvantage is that the Binlog system is relatively complicated. A typical implementation is to use canal to realize data synchronization.

Guess you like

Origin blog.csdn.net/weixin_45178729/article/details/127162924