Canal: Synchronous MySQL incremental data tool, a detailed explanation of core knowledge points

Lao Liu is a second-year graduate student who is about to find a job. On the one hand, writing a blog is to summarize the knowledge points of big data development, on the other hand, he hopes to help his partners make self-study without asking for help. Since Lao Liu is self-taught in big data development, there will definitely be some shortcomings in the blog, and I hope everyone can criticize and correct, let us make progress together!

background

Data sources in the big data field include business database data, mobile-end embedded point data, and server-side log data. When we collect data, we can use different collection tools according to the different requirements of the downstream data. Today, Lao Liu tells you Canal, a tool for synchronizing mysql incremental data. The outline of this article is as follows:

  1. Canal concept
  2. Principles of master-slave replication in mysql
  3. How Canal synchronizes data from MySQL
  4. Canal's HA mechanism design
  5. A brief summary of various data synchronization solutions

Lao Liu strives to use this article to let everyone get started with Canal directly, instead of spending other time learning.

Principles of mysql master-slave replication

Since Canal is used to synchronize incremental data in mysql, Liu will first talk about the principle of mysql's primary and backup replication, and then talk about Canal's core knowledge points.

According to this picture, Lao Liu decomposes the principle of primary and secondary replication of mysql into the following processes:

  1. The main server must first start the binary log binlog to record any events that modify the database data.
  2. The master server records the data changes to the binary binlog log.
  3. The slave server will copy the binary log of the master server to its local relay log (Relaylog). This step is detailed that first the slave server will start a worker thread I/O thread, the I/O thread will establish an ordinary client single connection with the main library, and then start a special binary dump (binlog dump) thread, this binlog dump thread will read the events in the binary log on the master server, and then send the binary events to the I/O thread and save them in the relay log on the slave server.
  4. Start the SQL thread from the server, read the binary log from the relay log, and perform another data modification operation locally on the slave server to update the data from the server.

Then the principle of mysql master-slave replication implementation is over. After reading this process, can you guess how Canal works?

Canal core knowledge

How Canal Works

The working principle of Canal is that it simulates the interaction protocol of the MySQL slave, disguise itself as a MySQL slave, and launches the dump protocol to the MySQL master. After MySQL master receives the dump request, it will start pushing binlog to Canal. Finally, Canal will parse the binlog object.

Canal concept

Canal, the United States [kəˈnæl], is read like this, meaning waterway/pipe/channel. The main purpose is to synchronize incremental data in MySQL (can be understood as real-time data). It is a pure Java development under Alibaba. Open source project.

Canal architecture

The server represents a canal running instance and corresponds to a JVM. Instance corresponds to a data queue, and 1 canal server corresponds to 1..n submodules under instance instance:

  1. EventParser: data source access, simulation of the salve protocol and master interaction, protocol analysis
  2. EventSink: Parser and Store linker, data filtering, processing, and distribution work
  3. EventStore: data storage
  4. MetaManager: Incremental Subscription & Consumer Information Manager

Now that the basic concepts of Canal are finished, then we will talk about how Canal synchronizes mysql incremental data.

Canal synchronize MySQL incremental data

Open mysql binlog

The prerequisite for us to synchronize mysql incremental data with Canal is that the binlog of mysql is enabled. Binlog is enabled by default in the mysql database of Alibaba Cloud, but if we install mysql by ourselves, we need to manually enable the binlog log function.

First find the mysql configuration file:

etc/my.cnf

server-id=1
log-bin=mysql-bin
binlog-format=ROW

There is a knowledge point about the format of binlog, Lao Liu will tell you about it.

There are three formats of binlog: STATEMENT, ROW, MIXED

  1. ROW mode (usually use it)

    The log will record the form of each row of data being modified, and will not record the context-related information of the execution of the SQL statement. It only records the data to be modified, which data has been modified, and what the modification looks like. There is only value, not SQL. Table association situation.

    Advantages: It only needs to record which piece of data has been modified and what it was modified into, so its log content will clearly record the details of each row of data modification, which is very easy to understand.

    Disadvantages: In ROW mode, especially when data is added, all executed statements will be recorded in the log, and will be recorded with the modification of each row record, which will generate a large amount of log content.

  2. STATEMENT mode

    Every SQL statement that will modify data will be recorded.

    Disadvantages: Since it is a recorded execution statement, in order for these statements to be executed correctly on the slave side, he must also record some relevant information during the execution of each statement, that is, context information, to ensure all statements When executed on the slave side, the same results can be obtained as when executed on the master side.

    But at present, for example, the step() function cannot be copied correctly in some versions. The last-insert-id() function is used in the stored procedure, which may cause inconsistent IDs on the slave and master, that is, inconsistent data In ROW mode, there is no.

  3. MIXED mode

    Both of the above two modes are used.

Canal real-time synchronization

  1. First we need to configure the environment, under conf/example/instance.properties:
 ## mysql serverId
 canal.instance.mysql.slaveId = 1234
 #position info,需要修改成自己的数据库信息
 canal.instance.master.address = 127.0.0.1:3306
 canal.instance.master.journal.name =
 canal.instance.master.position =
 canal.instance.master.timestamp =
 #canal.instance.standby.address =
 #canal.instance.standby.journal.name =
 #canal.instance.standby.position =
 #canal.instance.standby.timestamp =
 #username/password,需要修改成自己的数据库信息
 canal.instance.dbUsername = canal
 canal.instance.dbPassword = canal
 canal.instance.defaultDatabaseName =
 canal.instance.connectionCharset = UTF-8
 #table regex
 canal.instance.filter.regex = .\*\\\\..\*

Among them, canal.instance.connectionCharset represents the encoding method of the database corresponding to the encoding type in java, such as UTF-8, GBK, ISO-8859-1.

  1. After configuration, it is about to start
 sh bin/startup.sh
 关闭使用 bin/stop.sh
  1. Observation log

    Generally use cat to view canal/canal.log, example/example.log

  2. Start the client

    Business code in IDEA, if there is incremental data in mysql, pull it in and print it out on IDEA console

    Add in the pom.xml file:

 <dependency>
   <groupId>com.alibaba.otter</groupId>
   <artifactId>canal.client</artifactId>
   <version>1.0.12</version>
 </dependency>

Add client code:

public class Demo {
    
    
 public static void main(String[] args) {
     //创建连接
     CanalConnector connector = CanalConnectors.newSingleConnector(new InetSocketAddress("hadoop03", 11111),
             "example""""");
     connector.connect();
     //订阅
     connector.subscribe();
     connector.rollback();
     int batchSize = 1000;
     int emptyCount = 0;
     int totalEmptyCount = 100;
     while (totalEmptyCount > emptyCount) {
         Message msg = connector.getWithoutAck(batchSize);
         long id = msg.getId();
         List<CanalEntry.Entry> entries = msg.getEntries();
         if(id == -1 || entries.size() == 0){
             emptyCount++;
             System.out.println("emptyCount : " + emptyCount);
             try {
                 Thread.sleep(3000);
             } catch (InterruptedException e) {
                 e.printStackTrace();
             }
         }else{
             emptyCount = 0;
             printEntry(entries);
         }
         connector.ack(id);
     }
 }
 // batch -> entries -> rowchange - rowdata -> cols
 private static void printEntry(List<CanalEntry.Entry> entries) {
     for (CanalEntry.Entry entry : entries){
         if(entry.getEntryType() == CanalEntry.EntryType.TRANSACTIONBEGIN ||
                 entry.getEntryType() == CanalEntry.EntryType.TRANSACTIONEND){
             continue;
         }
         CanalEntry.RowChange rowChange = null;
         try {
             rowChange = CanalEntry.RowChange.parseFrom(entry.getStoreValue());
         } catch (InvalidProtocolBufferException e) {
             e.printStackTrace();
         }
         CanalEntry.EventType eventType = rowChange.getEventType();
         System.out.println(entry.getHeader().getLogfileName()+" __ " +
                 entry.getHeader().getSchemaName() + " __ " + eventType);
         List<CanalEntry.RowData> rowDatasList = rowChange.getRowDatasList();
         for(CanalEntry.RowData rowData : rowDatasList){
             for(CanalEntry.Column column: rowData.getAfterColumnsList()){
                 System.out.println(column.getName() + " - " +
                         column.getValue() + " - " +
                         column.getUpdated());
             }
         }
     }
 }
}
  1. Write data in mysql, the client will print the incremental data to the console.

Canal's HA mechanism design

In the field of big data, many frameworks have HA mechanisms. Canal's HA is divided into two parts. Canal server and Canal client have corresponding HA implementations:

  1. Canal server: In order to reduce requests for mysql dump, instances on different servers require that only one instance is running at the same time, and the others are in standby state.
  2. Canal client: In order to ensure orderliness, only one canal client can perform get/ack/rollback operations for an instance at the same time, otherwise the order of reception by the client cannot be guaranteed.

The control of the entire HA mechanism mainly relies on several features of ZooKeeper, so ZooKeeper will not be discussed here.

Canal Server:

  1. When the canal server wants to start a canal instance, it will first make an attempt to start judgment to ZooKeeper (create an EPHEMERAL node, whoever can start it successfully).
  2. After the ZooKeeper node is successfully created, the corresponding canal server starts the corresponding canal instance, and the canal instance that is not created successfully will be in the standby state.
  3. Once ZooKeeper finds that the node created by the canal server disappears, it immediately informs other canal servers to perform step 1 again, and re-select a canal server to start the instance.
  4. Each time the canal client connects, it will first ask ZooKeeper who has started the canal instance, and then establish a connection with it. Once the connection is unavailable, it will try to connect again.
  5. The method of canal client is similar to that of canal server, and it also uses ZooKeeper's method of preempting EPHEMERAL nodes for control.

Configure Canal HA and synchronize data to kafka in real time.

  1. Modify the conf/canal.properties file
 canal.zkServers = hadoop02:2181,hadoop03:2181,hadoop04:2181
 canal.serverMode = kafka
 canal.mq.servers = hadoop02:9092,hadoop03:9092,hadoop04:9092
  1. 配置 conf/example/example.instance
  canal.instance.mysql.slaveId = 790 /两台canal server的slaveID唯一
  canal.mq.topic = canal_log //指定将数据发送到kafka的topic

Summary of data synchronization program

After talking about the Canal tool, now I will give you a brief summary of the current common data collection tools, which will not involve architecture knowledge, but a simple summary to give everyone an impression.

Common data collection tools are: DataX, Flume, Canal, Sqoop, LogStash, etc.

DataX (processing offline data)

DataX is an offline synchronization tool for heterogeneous data sources open sourced by Alibaba. Offline synchronization of heterogeneous data sources refers to synchronizing data from the source to the destination. However, there are many types of end-to-end data sources. Before DataX, the end-to-end and The link at the end will form a complex mesh structure, which is very fragmented and cannot abstract the synchronization core logic.

In order to solve the synchronization problem of heterogeneous data sources, DataX has transformed the complex meshed synchronization link into a star data link. As an intermediate transmission carrier, DataX is responsible for connecting various data sources.

Therefore, when you need to access a new data source, you only need to connect this data source to DataX, and you can seamlessly synchronize data with the existing data source.

DataX itself, as an offline data synchronization framework, is built with Framework+plugin architecture. The data source reading and writing are abstracted into Reader/Writer plug-ins and incorporated into the entire synchronization framework.

  1. Reader: It is the data collection module, responsible for collecting data from the data source and sending the data to the Framework.
  2. Writer: It is a data writing module, which is responsible for continuously fetching data from the Framework and writing data to the destination.
  3. Framework: It is used to connect Reader and Writer, as the data transmission channel between the two, and deal with buffering, concurrency, data conversion and other issues.

The core architecture of DataX is shown below:

Core module introduction:

  1. DataX completes a single data synchronization job. We call it a job. After DataX receives a job, it will start a process to complete the entire job synchronization process.
  2. After the DataX Job is started, it will be divided into multiple small tasks (subtasks) according to different source-side segmentation strategies to facilitate concurrent execution.
  3. After splitting multiple tasks, DataX Job will call the Scheduler module to recombine the split tasks according to the configured amount of concurrent data, and assemble them into TaskGroup (task group). Each TaskGroup is responsible for running all tasks allocated in a certain amount of concurrency, and the default number of concurrent tasks for a single task group is 5.
  4. Each Task is started by the TaskGroup. After the Task is started, the Reader->Channel->Writer thread will be started to complete the task synchronization work.
  5. After the completion of the DataX job, the job monitors and waits for the completion of multiple TaskGroup module tasks, and waits for the job to exit successfully after all TaskGroup tasks are completed. Otherwise, exit abnormally.

Flume (processing real-time data)

The main application scenario of Flume is to synchronize log data, which mainly includes three components: Source, Channel, and Sink.

The biggest advantage of Flume is that the official website provides a wealth of Source, Channel, and Sink. According to different business needs, we can find relevant configurations on the official website. In addition, Flume also provides interfaces for customizing these components.

Logstash (processing offline data)

Logstash is a pipeline with real-time data transmission capabilities, which is responsible for transmitting data information from the input end of the pipeline to the output end of the pipeline; at the same time, this pipeline also allows you to add a filter in the middle according to your needs, Logstash Provides many powerful filters to meet various application scenarios.

Logstash is written by JRuby, uses a simple message-based architecture, and runs on the JVM. The data flow in the pipeline is called event, which is divided into inputs phase, filters phase, and outputs phase.

Sqoop (processing offline data)

Sqoop is a tool for transferring data between Hadoop and relational databases. It is used to export data from relational databases such as MySQL to Hadoop HDFS from Hadoop file system to relational databases. The bottom layer of Sqoop is still MapReducer, so you must pay attention to data tilt when using it.

to sum up

Lao Liu's article mainly talks about the core knowledge points of Canal tools and the comparison of their data collection tools. Among them, the data collection tools only talk about concepts and applications in general, and the purpose is to make everyone have an impression. Old Liu dare to make sure that after reading this article, it is basically equivalent to getting started, and the rest is practice.

Alright, the content of Canal, a tool for synchronizing mysql incremental data, is finished. Although the current level may not be as good as that of the big brothers, Lao Liu will work hard to become better, so that all friends will learn by themselves and never ask for help!

If you have any related questions, please contact the official account: Lao Liu who works hard. I have seen this in the article, like it, follow it and support it!

Guess you like

Origin blog.csdn.net/qq_36780184/article/details/112980633