Use Logstash and JDBC ensure Elasticsearch with a relational database to keep pace

In order to take full advantage of powerful search capabilities Elasticsearch offer, many companies will deploy Elasticsearch on the basis of the existing relational database. In this case, it may be necessary to ensure that the data in a relational database Elasticsearch associated sync. Therefore, in this post, I will demonstrate how to use Logstash to efficiently replicate updated data and relational databases in sync to Elasticsearch. Code and methods set forth herein have been tested using MySQL, but in theory should apply to any relational database management system (RDBMS).

System Configuration

In this article, I use the following products were tested:

An overview of the whole synchronization step

In this post, we use Logstash and JDBC input plug-in to make Elasticsearch keep pace with MySQL. Conceptually, Logstash input of JDBC plug-in will run a loop to regular MySQL poll to find out the record is inserted or changed since the last iteration of the loop. In order for it to run properly, the following conditions must be met:

  1. When the document is written in MySQL Elasticsearch, Elasticsearch the "_id" field must be set in MySQL "id" field. This can establish a direct mapping between MySQL records Elasticsearch document. If you update a record in MySQL, it will cover the whole of the relevant records in Elasticsearch. Please note that the documents covered in Elasticsearch Efficiency and update operations as efficient, because the principle of speaking from the inside, the update will include deleting old documents and new documents subsequently indexed.
  2. When data is inserted or updated in MySQL, which records must contain a time update or insert field. This field, you can request a permit Logstash only after the last polling loop iteration edit or insert documents. When each of the MySQL Logstash poll will be saved to update the last record that the read time or inserted from MySQL. At the next iteration, Logstash we know it is only a request to obtain records that meet the following conditions: Update or insert later than the last record on the polling loop iteration received.

If the above conditions are met, we can configure Logstash, to periodically request a new or edited all records from MySQL, and then writes them in Elasticsearch. Completion of these operations Logstash code will be listed later in this post.

MySQL settings

You can use the following code configuration MySQL database and tables:

CREATE DATABASE es_db;
USE es_db;
DROP TABLE IF EXISTS es_table;
CREATE TABLE es_table (
  id BIGINT(20) UNSIGNED NOT NULL,
  PRIMARY KEY (id),
  UNIQUE KEY unique_id (id),
  client_name VARCHAR(32) NOT NULL,
  modification_time TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  insertion_time TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
);

In the above configuration, MySQL, there are a few parameters need special attention:

  • es_table: This is the name of the MySQL data tables, data are read out from here and synced to Elasticsearch.
  • id: This is a unique identifier which records. Please note that the "id" has been defined as PRIMARY KEY (primary key) and UNIQUE KEY (unique key). This ensures that each "id" appears only once in the current table. It will be converted to "_id", for updating Elasticsearch the document into the document and to Elasticsearch in.
  • client_name: This field indicates a user-defined data in each record is stored. In this post, for simplicity, we only have one field that contains user-defined data, but you can easily add more fields. We want to change is this field, in order to demonstrate to you not only the newly inserted record is copied to the MySQL Elasticsearch in, and update the record also spread to the right in the Elasticsearch.
  • modification_time: When you insert or change any MySQL record, this will be the value of a field to define time editing. With this editing time, we will be able to extract any recorded since the last Logstash request a record after editing from MySQL.
  • insertion_time: This field is primarily used for demonstration purposes, not strictly necessary conditions to be fulfilled correctly synchronized. We use it to keep track of time initially inserted into MySQL.

MySQL operation

After the above configuration, the recording can be written into MySQL by the following statement:

INSERT INTO es_table (id, client_name) VALUES (<id>, <client name>);

You can update records in MySQL with the following command:

UPDATE es_table SET client_name = <new client name> WHERE id=<id>;

You can be accomplished by the following statement MySQL update / insert (upsert):

INSERT INTO es_table (id, client_name) VALUES (<id>, <client name when created> ON DUPLICATE KEY UPDATE client_name=<client name when updated>;

Synchronization code

The following embodiments will Logstash pipeline synchronization code described in the previous section:

input {
  jdbc {
    jdbc_driver_library => "<path>/mysql-connector-java-8.0.16.jar"
    jdbc_driver_class => "com.mysql.jdbc.Driver"
    jdbc_connection_string => "jdbc:mysql://<MySQL host>:3306/es_db"
    jdbc_user => <my username>
    jdbc_password => <my password>
    jdbc_paging_enabled => true
    tracking_column => "unix_ts_in_secs"
    use_column_value => true
    tracking_column_type => "numeric"
    schedule => "*/5 * * * * *"
    statement => "SELECT *, UNIX_TIMESTAMP(modification_time) AS unix_ts_in_secs FROM es_table WHERE (UNIX_TIMESTAMP(modification_time) > :sql_last_value AND modification_time < NOW()) ORDER BY modification_time ASC"
  }
}
filter {
  mutate {
    copy => { "id" => "[@metadata][_id]"}
    remove_field => ["id", "@version", "unix_ts_in_secs"]
  }
}
output {
  # stdout { codec =>  "rubydebug"}
  elasticsearch {
      index => "rdbms_sync_idx"
      document_id => "%{[@metadata][_id]}"
  }
}
Read Less

In the pipeline, we could focus on several areas:

  • tracking_column: This field specifies the "unix_ts_in_secs" field (for tracking the last document Logstash read from MySQL, as will be described below), which is stored in .logstash_jdbc_last_run on disk. This value will be used to determine the starting value Logstash in its polling loop iteration of the requested document. In .logstash_jdbc_last_run value stored in as: access through SELECT statement "sql_last_value".
  • unix_ts_in_secs: This is a SELECT statement generated by the field, comprising as a standard Unix time (the number of seconds since the Epoch onwards) the "modification_time". "Tracking column" We just discussed will reference the field. Unix timestamp for tracking progress, rather than as a simple time stamp; it as simple as timestamp, may cause an error, because the right to switch back and forth between UMT and the local time zone is a very complex process.
  • sql_last_value: This is a built-in parameters , including starting point Logstash polling loop current iteration, enter the SELECT statement above JDBC configuration will reference this parameter. This field will be set to the latest value "unix_ts_in_secs" (read from .logstash_jdbc_last_run) of. MySQL query in Logstash polling loop executed, it will be used as a starting point for the return of the document. By adding this variable in a query, to ensure that does not propagate to Elasticsearch before insert or update content to Elasticsearch again.
  • schedule: Its use cron syntax to specify how often should Logstash for MySQL poll to find the changes. Specified here "*/5 * * * * *"will tell Logstash every 5 seconds contact time MySQL.
  • modification_time < NOW(): This part of the SELECT is a difficult concept to explain, we will explain in detail in the next section.
  • filter: In this section, we will simply MySQL record "id" value is copied to a file named "_id" metadata fields, because we will reference the field after output, to ensure that each of the write Elasticsearch have the correct documentation "_id" value. Using metadata field, you can ensure that this does not lead to a temporary value to create a new field. We have also removed the "id", "@ version" and "unix_ts_in_secs" fields from the document, because we do not want to write to Elasticsearch in these fields.
  • output: In this section, we specify that each document should be written Elasticsearch, the need to assign a _id. (To be extracted from the metadata field that we created in the screening section) there will be a commented-out code contains rubydebug output is enabled after this output can help you troubleshoot.

SELECT statement validity analysis

In this section, we will add detail to explain why in the SELECT statement modification_time < NOW()is crucial. To help explain this concept, we first give a few negative examples to show you why the two most intuitive way to work. Then explain why adding modification_time < NOW()able to overcome those two problems caused by an intuitive method.

Method intuitive application: a

In this section, we will demonstrate if the WHERE clause is not included modification_time < NOW(), but only specified UNIX_TIMESTAMP(modification_time) > :sql_last_value, then what happens. In this case, SELECT statement is as follows:

statement => "SELECT *, UNIX_TIMESTAMP(modification_time) AS unix_ts_in_secs FROM es_table WHERE (UNIX_TIMESTAMP(modification_time) > :sql_last_value) ORDER BY modification_time ASC"

At first glance, the above method should be run like normal, but for some edge cases, it may miss some documents. An example, we assume that MySQL now insert two documents per second, Logstash SELECT statement is executed once every 5 seconds. DETAILED shown below, T0 to T10 are respectively representative of every second, it places the data in MySQL represented R1 to R22. We assume first Logstash polling loop iteration occurs T5, reads the document which R1 to R11, as shown in block cyan. The sql_last_valuevalue is now stored in T5, because this is the last record (R11) read timestamp. We also assume that after Logstash finished reading the file from MySQL, another time stamp for the T5 document R12 immediately inserted into MySQL.

In the next iteration of the above SELECT statement, we will only extraction time is later than the T5 document (as WHERE (UNIX_TIMESTAMP(modification_time) > :sql_last_value)is the case specified), which means that records will be skipped R12. You can see the chart below, where the blue-green boxes represent record Logstash read in the current iteration, gray box represents the record read before Logstash.

Note that if you use a SELECT statement in this case, the record is never written to the R12 in Elasticsearch.

Method intuitive application: two

To solve the above problem, you may decide to change the WHERE clause is greater than or equals (later than or equal to), as follows:

statement => "SELECT *, UNIX_TIMESTAMP(modification_time) AS unix_ts_in_secs FROM es_table WHERE (UNIX_TIMESTAMP(modification_time) >= :sql_last_value) ORDER BY modification_time ASC"

However, this implementation strategy is not ideal. The problem in this case is: a recent document read from MySQL in the most recent time interval is repeatedly sent to Elasticsearch. Although this will not have any impact on the correctness of the result, but it does do useful work. And like the first part, after the initial iteration Logstash poll, following illustration shows the documents which have been read from MySQL.

When performing subsequent Logstash polling iteration, we will later than or equal to T5 document all extracted. Can be found in the chart below. Note: Record 11 (shown in purple) will be sent to Elasticsearch again.

The first two cases are not that good. In the first case, the loss of data, while in the second case, the redundant data from MySQL reads and sends the data to Elasticsearch.

How to solve the problem caused by the intuitive method

Given the first two cases are not ideal, another approach should be adopted. By specifying (UNIX_TIMESTAMP(modification_time) > :sql_last_value AND modification_time < NOW()), we will send each document to Elasticsearch, and sent only once.

请参见下面的图表,其中当前的 Logstash 轮询会在 T5 执行。请注意,由于必须满足 modification_time < NOW(),所以只会从 MySQL 中读取截至(但不包括)时间段 T5 的文档。由于我们已经提取了 T4 的全部文档,而未读取 T5 的任何文档,所以我们知道对于下一次的Logstash 轮询迭代,sql_last_value 将会被设置为 T4。

下图演示了在 Logstash 轮询的下一次迭代中将会发生什么情况。由于 UNIX_TIMESTAMP(modification_time) > :sql_last_value,并且 sql_last_value 设置为 T4,我们知道仅会从 T5 开始提取文档。此外,由于只会提取满足 modification_time < NOW() 的文档,所以仅会提取到截至(含)T9 的文档。再说一遍,这意味着 T9 中的所有文档都已提取出来,而且对于下一次迭代 sql_last_value 将会设置为 T9。所以这一方法消除了对于任何给定时间间隔仅检索到 MySQL 文档的一个子集的风险。

测试系统

可以通过一些简单测试来展示我们的实施方案能够实现预期效果。我们可以使用下列命令向 MySQL 中写入记录:

INSERT INTO es_table (id, client_name) VALUES (1, 'Jim Carrey');
INSERT INTO es_table (id, client_name) VALUES (2, 'Mike Myers');
INSERT INTO es_table (id, client_name) VALUES (3, 'Bryan Adams');

JDBC 输入计划触发了从 MySQL 读取记录的操作并将记录写入 Elasticsearch 后,我们即可运行下列 Elasticsearch 查询来查看 Elasticsearch 中的文档:

GET rdbms_sync_idx/_search

其会返回类似下面回复的内容:

"hits" : {
    "total" : {
      "value" :3,
      "relation" : "eq"
    },
    "max_score" :1.0,
    "hits" : [
      {
        "_index" : "rdbms_sync_idx",
        "_type" : "_doc",
        "_id" :"1",
        "_score" :1.0,
        "_source" : {
          "insertion_time" :"2019-06-18T12:58:56.000Z",
          "@timestamp" :"2019-06-18T13:04:27.436Z",
          "modification_time" :"2019-06-18T12:58:56.000Z",
          "client_name" :"Jim Carrey"
        }
      },
Etc …

然后我们可以使用下列命令更新在 MySQL 中对应至 _id=1 的文档:

UPDATE es_table SET client_name = 'Jimbo Kerry' WHERE id=1;

其会正确更新 _id 被识别为 1 的文档。我们可以通过运行下列命令直接查看 Elasticsearch 中的文档:

GET rdbms_sync_idx/_doc/1

其会返回一个类似下面的文档:

{
  "_index" : "rdbms_sync_idx",
  "_type" : "_doc",
  "_id" :"1",
  "_version" :2,
  "_seq_no" :3,
  "_primary_term" :1,
  "found" : true,
  "_source" : {
    "insertion_time" :"2019-06-18T12:58:56.000Z",
    "@timestamp" :"2019-06-18T13:09:30.300Z",
    "modification_time" :"2019-06-18T13:09:28.000Z",
    "client_name" :"Jimbo Kerry"
  }
}

请注意 _version 现已设置为 2,modification_time 现在已不同于 insertion_time,并且 client_name 字段已正确更新至新值。在本例中,@timestamp 字段的用处并不大,由 Logstash 默认添加。

MySQL 中的更新/插入 (upsert) 可通过下列命令完成,您可以验证正确信息是否会反映在 Elasticsearch 中:

INSERT INTO es_table (id, client_name) VALUES (4, 'Bob is new') ON DUPLICATE KEY UPDATE client_name='Bob exists already';

那么删除文档呢

聪明的读者可能已经发现,如果从 MySQL 中删除一个文档,那么这一删除操作并不会传播到 Elasticsearch。可以考虑通过下列方法来解决这一问题:

  1. MySQL 记录可以包含一个 "is_deleted" 字段,用来显示该条记录是否仍有效。这一方法被称为“软删除”。正如对 MySQL 中的记录进行其他更新一样,"is_deleted" 字段将会通过 Logstash 传播至 Elasticsearch。如果实施这一方法,则需要编写 Elasticsearch 和 MySQL 查询,从而将 "is_deleted" 为 “true”(正)的记录/文档排除在外。 最后,可以通过后台作业来从 MySQL 和 Elastic 中移除此类文档。
  2. 另一种方法是确保负责从 MySQL 中删除记录的任何系统随后也会执行一条命令,从而直接从 Elasticsearch 中删除相应文档。

Guess you like

Origin www.cnblogs.com/sanduzxcvbnm/p/12076516.html