1. Hive incremental query Hudi table
Synchronize Hive
table name
When we write data, we can configure the synchronization Hive parameters to generate the corresponding Hive table to query the Hudi table. Specifically, two Hive tables named by are passed during the writing process. For example, if table name = hudi_tbl
, we get
hudi_tbl
Implemented HoodieParquetInputFormat
read-optimized views of datasets backed by , providing purely columnar data
hudi_tbl_rt
Real-time views of datasets supported by are implemented HoodieParquetRealtimeInputFormat
, providing a consolidated view of base and log data
The above two comparisons are taken from the official website, here is an explanation: the real-time view _rt
table will only exist when the MOR table synchronizes Hive metadata, and hudi_tbl
when the table type is MOR and is configured, skipROSuffix=true
it is a read-optimized view, when it is false ( When the default is false), the read-optimized view should be hudi_tbl_ro
, and when the table type is COW, hudi_tbl
it should be a real-time view, so please pay attention to the official website’s explanation of this part.
Incremental query
Modify the configuration hive-site.xml
Add hoodie.* to the Hive SQL whitelist. Others are existing configurations. You can also add other whitelists as needed, such as:tez.*|parquet.*|planner.*
hive.security.authorization.sqlstd.confwhitelist.append hoodie.*|mapred.*|hive.*|mapreduce.*|spark.*
Setting parameters
Take the table name hudi_tbl as an example
Connect Hive connect/Hive Shell
Set the table as an incremental table
set hoodie.hudi_tbl.consume.mode=INCREMENTAL;
Set the timestamp of the start of the increment (not included), function: to filter at the file level and reduce the number of maps
set hoodie.hudi_tbl.consume.start.timestamp=20211015182330;
Set the number of commits for incremental consumption, the default setting is -1, which means that the incremental consumption reaches the current new data
set hoodie.hudi_tbl.consume.max.commits=-1;
Modify the number of commits as needed
Check for phrases
select * from hudi_tbl where `_hoodie_commit_time` > "20211015182330";
Due to the small file merging mechanism, the new commit timestamp file contains old data, so it is necessary to add where for secondary filtering
Note: The effective range of setting parameters here is that connect session
Hudi version 0.9.0 only supports table name parameters, and does not support database restrictions. After setting it hudi_tbl
as an incremental table, all databases with this table name will be incremental queries when querying the table. The parameters such as mode and start time are the last set values. In the new version later, database restrictions are added, such as hudi database
2. Spark SQL incremental query Hudi table
Programming method (DF+SQL)
First look at the way of Spark SQL incremental query on the official document
Address 1: https://hudi.apache.org/cn/docs/quick-start-guide#incremental-query
Address 2:https://hudi.apache.org/cn/docs/querying_data#incremental-query
It first reads the Hudi table as DF by adding incremental parameters in spark.read, then registers DF as a temporary table, and finally queries the temporary table through Spark SQL to realize incremental query
parameter
hoodie.datasource.query.type=incremental query type, when the value is incremental, it represents incremental query, the default value is snapshot, when incremental query, this parameter is required
hoodie.datasource.read.begin.instanttime Incremental query start time, required for example: 20221126170009762
hoodie.datasource.read.end.instanttime Incremental query end time, optional example: 20221126170023240
hoodie.datasource.read.incr.path.glob Incremental query to specify the partition path, optional eg /dt=2022-11/
Query range (BEGIN_INSTANTTIME, END_INSTANTTIME], which is greater than the start time (not included), less than or equal to the end time (included), if no end time is specified, then query the latest data greater than BEGIN_INSTANTTIME so far, if INCR_PATH_GLOB is specified, then only in Query the corresponding data under the specified partition path
code example
import org.apache.hudi.DataSourceReadOptions.{BEGIN_INSTANTTIME, END_INSTANTTIME, INCR_PATH_GLOB, QUERY_TYPE, QUERY_TYPE_INCREMENTAL_OPT_VAL}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.catalyst.TableIdentifier
val tableName = "test_hudi_incremental"
spark.sql(
s"""
|create table $tableName (
| id int,
| name string,
| price double,
| ts long,
| dt string
|) using hudi
| partitioned by (dt)
| options (
| primaryKey = 'id',
| preCombineField = 'ts',
| type = 'cow'
| )
|""".stripMargin)
spark.sql(s"insert into $tableName values (1,'hudi',10,100,'2022-11-25')")
spark.sql(s"insert into $tableName values (2,'hudi',10,100,'2022-11-25')")
spark.sql(s"insert into $tableName values (3,'hudi',10,100,'2022-11-26')")
spark.sql(s"insert into $tableName values (4,'hudi',10,100,'2022-12-26')")
spark.sql(s"insert into $tableName values (5,'hudi',10,100,'2022-12-27')")
val table = spark.sessionState.catalog.getTableMetadata(TableIdentifier(tableName))
val basePath = table.storage.properties("path")
// incrementally query data
val incrementalDF = spark.read.format("hudi").
option(QUERY_TYPE.key, QUERY_TYPE_INCREMENTAL_OPT_VAL).
option(BEGIN_INSTANTTIME.key, beginTime).
option(END_INSTANTTIME.key, endTime).
option(INCR_PATH_GLOB.key, "/dt=2022-11*/*").
load(basePath)
// table(tableName)
incrementalDF.createOrReplaceTempView(s"temp_$tableName")
spark.sql(s"select * from temp_$tableName").show()
spark.stop()
result
+-------------------+--------------------+------------------+----------------------+--------------------+---+----+-----+---+----------+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| _hoodie_file_name| id|name|price| ts| dt|
+-------------------+--------------------+------------------+----------------------+--------------------+---+----+-----+---+----------+
| 20221126165954300|20221126165954300...| id:1| dt=2022-11-25|de99b299-b9de-423...| 1|hudi| 10.0|100|2022-11-25|
| 20221126170009762|20221126170009762...| id:2| dt=2022-11-25|de99b299-b9de-423...| 2|hudi| 10.0|100|2022-11-25|
| 20221126170030470|20221126170030470...| id:5| dt=2022-12-27|75f8a760-9dc3-452...| 5|hudi| 10.0|100|2022-12-27|
| 20221126170023240|20221126170023240...| id:4| dt=2022-12-26|4751225d-4848-4dd...| 4|hudi| 10.0|100|2022-12-26|
| 20221126170017119|20221126170017119...| id:3| dt=2022-11-26|2272e513-5516-43f...| 3|hudi| 10.0|100|2022-11-26|
+-------------------+--------------------+------------------+----------------------+--------------------+---+----+-----+---+----------+
+-----------------+
| commit_time|
+-----------------+
|20221126170030470|
|20221126170023240|
|20221126170017119|
|20221126170009762|
|20221126165954300|
+-----------------+
20221126170009762
20221126170023240
+-------------------+--------------------+------------------+----------------------+--------------------+---+----+-----+---+----------+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| _hoodie_file_name| id|name|price| ts| dt|
+-------------------+--------------------+------------------+----------------------+--------------------+---+----+-----+---+----------+
| 20221126170017119|20221126170017119...| id:3| dt=2022-11-26|2272e513-5516-43f...| 3|hudi| 10.0|100|2022-11-26|
+-------------------+--------------------+------------------+----------------------+--------------------+---+----+-----+---+----------+
Commenting out INCR_PATH_GLOB, the result
+-------------------+--------------------+------------------+----------------------+--------------------+---+----+-----+---+----------+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| _hoodie_file_name| id|name|price| ts| dt|
+-------------------+--------------------+------------------+----------------------+--------------------+---+----+-----+---+----------+
| 20221127155346067|20221127155346067...| id:4| dt=2022-12-26|33e7a2ed-ea28-428...| 4|hudi| 10.0|100|2022-12-26|
| 20221127155339981|20221127155339981...| id:3| dt=2022-11-26|a5652ae0-942a-425...| 3|hudi| 10.0|100|2022-11-26|
+-------------------+--------------------+------------------+----------------------+--------------------+---+----+-----+---+----------+
Proceed to comment out END_INSTANTTIME, the result
20221127161253433
20221127161311831
+-------------------+--------------------+------------------+----------------------+--------------------+---+----+-----+---+----------+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| _hoodie_file_name| id|name|price| ts| dt|
+-------------------+--------------------+------------------+----------------------+--------------------+---+----+-----+---+----------+
| 20221127161320347|20221127161320347...| id:5| dt=2022-12-27|7b389e57-ca44-4aa...| 5|hudi| 10.0|100|2022-12-27|
| 20221127161311831|20221127161311831...| id:4| dt=2022-12-26|2707ce02-548a-422...| 4|hudi| 10.0|100|2022-12-26|
| 20221127161304742|20221127161304742...| id:3| dt=2022-11-26|264bc4a9-930d-4ec...| 3|hudi| 10.0|100|2022-11-26|
+-------------------+--------------------+------------------+----------------------+--------------------+---+----+-----+---+----------+
You can see that the start time is not included, but the end time is included
Pure SQL
In general projects, pure SQL method is used for incremental query, which is more convenient. The parameters of pure SQL method are the same as those mentioned above. Next, let’s see how to implement it with pure SQL method.
Create a table and create a number
create table hudi.test_hudi_incremental (
id int,
name string,
price double,
ts long,
dt string
) using hudi
partitioned by (dt)
options (
primaryKey = 'id',
preCombineField = 'ts',
type = 'cow'
);
insert into hudi.test_hudi_incremental values (1,'a1', 10, 1000, '2022-11-25');
insert into hudi.test_hudi_incremental values (2,'a2', 20, 2000, '2022-11-25');
insert into hudi.test_hudi_incremental values (3,'a3', 30, 3000, '2022-11-26');
insert into hudi.test_hudi_incremental values (4,'a4', 40, 4000, '2022-12-26');
insert into hudi.test_hudi_incremental values (5,'a5', 50, 5000, '2022-12-27');
Look at what commit_time
select distinct(_hoodie_commit_time) from test_hudi_incremental order by _hoodie_commit_time
+----------------------+
| _hoodie_commit_time |
+----------------------+
| 20221130163618650 |
| 20221130163703640 |
| 20221130163720795 |
| 20221130163726780 |
| 20221130163823274 |
+----------------------+
Pure SQL method (1)
Use Call Procedures: copy_to_temp_view
, copy_to_table
, these two commands have been merged into the master at present, contributed by scxwhite Su Chengxiang, these two parameters are similar, it is recommended to use, copy_to_temp_view
because copy_to_table
the data will be placed on the disk first copy_to_temp_view
but a temporary table will be created, and the efficiency will be higher. Moreover, it is meaningless to place the data on the disk, and the table on the disk will be deleted later.
Supported parameters
table
query_type
view_name
begin_instance_time
end_instance_time
as_of_instant
replace
global
test-sql
call copy_to_temp_view(table => 'test_hudi_incremental', query_type => 'incremental',
view_name => 'temp_incremental', begin_instance_time=> '20221130163703640', end_instance_time => '20221130163726780');
select _hoodie_commit_time, id, name, price, ts, dt from temp_incremental;
result
+----------------------+-----+-------+--------+-------+-------------+
| _hoodie_commit_time | id | name | price | ts | dt |
+----------------------+-----+-------+--------+-------+-------------+
| 20221130163726780 | 4 | a4 | 40.0 | 4000 | 2022-12-26 |
| 20221130163720795 | 3 | a3 | 30.0 | 3000 | 2022-11-26 |
+----------------------+-----+-------+--------+-------+-------------+
It can be seen that this method can realize incremental query, but it should be noted that if you need to modify the start time of incremental query, you need to repeat copy_to_temp_view, but because the temporary table temp_incremental already exists, or a new table name , or delete it first, and then create a new one. I suggest deleting it first, and deleting it through the following command
drop view if exists temp_incremental;
Pure SQL method (2)
PR address: https://github.com/apache/hudi/pull/7182
This PR is also scxwhite
contributed by Spark, currently only supports Spark3.2 and above (currently the community has not merged)
Incremental query SQL
select id, name, price, ts, dt from tableName
[
'hoodie.datasource.query.type'=>'incremental',
'hoodie.datasource.read.begin.instanttime'=>'$instant1',
'hoodie.datasource.read.end.instanttime'=>'$instant2'
]
This method supports a new syntax. After querying SQL, by adding parameters in [], if you are interested, you can pull the code and try it yourself.
Pure SQL method (3)
The final effect is as follows
select
/*+
hoodie_prop(
'default.h1',
map('hoodie.datasource.read.begin.instanttime', '20221127083503537', 'hoodie.datasource.read.end.instanttime', '20221127083506081')
),
hoodie_prop(
'default.h2',
map('hoodie.datasource.read.begin.instanttime', '20221127083508715', 'hoodie.datasource.read.end.instanttime', '20221127083511803')
)
*/
id, name, price, ts
from (
select id, name, price, ts
from default.h1
union all
select id, name, price, ts
from default.h2
)
It is to add parameters related to incremental query in the hint, first specify the table name and then write the parameters, but the article does not seem to give the complete code address, you can try it yourself if you have time
Pure SQL method (4)
This method is the source code I modified according to the way of Hive incremental query Hudi, and the incremental query is realized by the method of set
PR address: https://github.com/apache/hudi/pull/7339
We already know that the s parameter
DefaultSource.createRelation
in Hudi is options = table.storage.properties ++ pathOption, which is the configuration parameter + path in the properties of the table itself. After that, it does not receive other parameters, so it cannot be set in the form of parameters make an inquiryoptParam
readDataSourceTable
createRelation
Same as Hive incremental query, specify the incremental query parameters of the specific table name
set hoodie.test_hudi_incremental.datasource.query.type=incremental
set hoodie.test_hudi_incremental.datasource.read.begin.instanttime=20221130163703640;
select _hoodie_commit_time, id, name, price, ts, dt from test_hudi_incremental;
+----------------------+-----+-------+--------+-------+-------------+
| _hoodie_commit_time | id | name | price | ts | dt |
+----------------------+-----+-------+--------+-------+-------------+
| 20221130163823274 | 5 | a5 | 50.0 | 5000 | 2022-12-27 |
| 20221130163726780 | 4 | a4 | 40.0 | 4000 | 2022-12-26 |
| 20221130163720795 | 3 | a3 | 30.0 | 3000 | 2022-11-26 |
+----------------------+-----+-------+--------+-------+-------------+
If different libraries have the same table name, you can use the form of library name.table name
## 需要先开启使用数据库名称限定表名的配置,开启后上面不加库名的配置就失效了
set hoodie.query.use.database = true;
set hoodie.hudi.test_hudi_incremental.datasource.query.type=incremental;
set hoodie.hudi.test_hudi_incremental.datasource.read.begin.instanttime=20221130163703640;
set hoodie.hudi.test_hudi_incremental.datasource.read.end.instanttime=20221130163726780;
set hoodie.hudi.test_hudi_incremental.datasource.read.incr.path.glob=/dt=2022-11*/*;
refresh table test_hudi_incremental;
select _hoodie_commit_time, id, name, price, ts, dt from test_hudi_incremental;
+----------------------+-----+-------+--------+-------+-------------+
| _hoodie_commit_time | id | name | price | ts | dt |
+----------------------+-----+-------+--------+-------+-------------+
| 20221130163720795 | 3 | a3 | 30.0 | 3000 | 2022-11-26 |
+----------------------+-----+-------+--------+-------+-------------+
You can try it yourself, the situation of different library table associations
One thing to note here is that after updating the parameters, you need to refresh table
query first, otherwise the parameters modified during the query will not take effect, because the parameters in the cache will be used
This method simply modifies the source code to make the parameters of set take effect on the query
In order to avoid that some readers find it troublesome to pack, here is
hudi-spark3.1-bundle_2.12-0.13.0-SNAPSHOT.jar
the download address for everyone:https://download.csdn.net/download/dkl12/87221476
3. Flink SQL incremental query Hudi table
Official website document
地址:https://hudi.apache.org/cn/docs/querying_data#incremental-query
parameter
read.start-commit Incremental query start time For streaming reading, if this value is not specified, the latest instantTime is taken by default, that is, streaming reading starts from the latest instantTime by default (including the latest). For batch reading, if this parameter is not specified and only read.end-commit is specified, the function of time travel can be realized and the historical records can be queried
read.end-commit Incremental query end time If this parameter is not specified, the latest record will be read by default. This parameter is generally only applicable to batch reading, because the general requirement of streaming reading is to query all incremental data
read.streaming.enabled Whether to read or not, the default is false
read.streaming.check-interval The check interval of streaming reading, in seconds (s), the default value is 60, which is one minute
The query range is [BEGIN_INSTANTTIME,END_INSTANTTIME], which includes both the start time and the end time. For the default value, please refer to the parameter description above
Version
Create a table and create a number:
Cruel 0.9.0
Spark 2.4.5
I use Hudi Spark SQL 0.9.0 for table creation here, the purpose is to simulate the Hudi table created with Java Client and Spark SQL on the project, to verify whether the Hudi Flink SQL incremental query is compatible with the old version of the Hudi table (you don’t have For this kind of demand, you can use any method to make numbers normally)
Inquire
Severe 0.13.0-SNAPSHOT
Flink 1.14.3 (incremental query)
Spark 3.1.2 (mainly for viewing commit information using the Call Procedures command)
Create a table and create a number
-- Spark SQL Hudi 0.9.0
create table hudi.test_flink_incremental (
id int,
name string,
price double,
ts long,
dt string
) using hudi
partitioned by (dt)
options (
primaryKey = 'id',
preCombineField = 'ts',
type = 'cow'
);
insert into hudi.test_flink_incremental values (1,'a1', 10, 1000, '2022-11-25');
insert into hudi.test_flink_incremental values (2,'a2', 20, 2000, '2022-11-25');
update hudi.test_flink_incremental set name='hudi2_update' where id = 2;
insert into hudi.test_flink_incremental values (3,'a3', 30, 3000, '2022-11-26');
insert into hudi.test_flink_incremental values (4,'a4', 40, 4000, '2022-12-26');
Use show_commits to see what commits are there (the query here is the master of Hudi, because show_commits is supported in version 0.11.0, and you can also view the .commit files under the .hoodie folder by using the hadoop command)
call show_commits(table => 'hudi.test_flink_incremental');
20221205152736
20221205152723
20221205152712
20221205152702
20221205152650
Flink SQL creates Hudi memory table
CREATE TABLE test_flink_incremental (
id int PRIMARY KEY NOT ENFORCED,
name VARCHAR(10),
price double,
ts bigint,
dt VARCHAR(10)
)
PARTITIONED BY (dt)
WITH (
'connector' = 'hudi',
'path' = 'hdfs://cluster1/warehouse/tablespace/managed/hive/hudi.db/test_flink_incremental'
);
Parameters related to incremental query are not specified when creating a table. We specify it dynamically when querying, which is more flexible. To dynamically specify the parameter method, add the following statement after the query statement
/*+
options(
'read.start-commit' = '20221205152723',
'read.end-commit'='20221205152736'
)
*/
batch reading
Flink SQL has two modes for reading Hudi: batch reading and streaming reading. The default batch reading, first look at the incremental query of batch reading
Verify that a start time and a default end time are included
select * from test_flink_incremental
/*+
options(
'read.start-commit' = '20221205152723' --起始时间对应id=3的记录
)
*/
The result contains the start time, if no end time is specified, the latest data will be read by default
id name price ts dt
4 a4 40.0 4000 dt=2022-12-26
3 a3 30.0 3000 dt=2022-11-26
Verify that the end time is included
select * from test_flink_incremental
/*+
options(
'read.start-commit' = '20221205152712', --起始时间对应id=2的记录
'read.end-commit'='20221205152723' --结束时间对应id=3的记录
)
*/
Result contains end time
id name price ts dt
3 a3 30.0 3000 dt=2022-11-26
2 hudi2_update 20.0 2000 dt=2022-11-25
Verify default start time
In this case, the end time is specified, but the start time is not specified. If neither is specified, all the records of the latest version of the table are read
select * from test_flink_incremental
/*+
options(
'read.end-commit'='20221205152712' --结束时间对应id=2的更新记录
)
*/
Result: only query the records corresponding to end-commit
id name price ts dt
2 hudi2_update 20.0 2000 dt=2022-11-25
Time travel (query history)
Verify whether the historical records can be queried. We update the name with id 2. Before the update, the name is a2, and after the update, it is hudi2_update. Let us verify whether the Hudi historical records can be queried through FlinkSQL. The expected result is id=2, name=a2
select * from test_flink_incremental
/*+
options(
'read.end-commit'='20221205152702' --结束时间对应id=2的历史记录
)
*/
Result: History can be queried correctly
id name price ts dt
2 a2 20.0 2000 dt=2022-11-25
streaming
Parameters to enable stream reading
read.streaming.enabled = true
Stream reading does not need to set the end time, because the general requirement is to read all incremental data, we only need to verify the start time
Verify default start time
select * from test_flink_incremental
/*+
options(
'read.streaming.enabled'='true',
'read.streaming.check-interval' = '4'
)
*/
Result: Incremental reading starts from the latest instantTime, that is, the default read.start-commit is the latest instantTime
id name price ts dt
4 a4 40.0 4000 dt=2022-12-26
Verify specified start time
select * from test_flink_incremental
/*+
options(
'read.streaming.enabled'='true',
'read.streaming.check-interval' = '4',
'read.start-commit' = '20221205152712'
)
*/
result
id name price ts dt
2 hudi2_update 20.0 2000 dt=2022-11-25
3 a3 30.0 3000 dt=2022-11-26
4 a4 40.0 4000 dt=2022-11-26
If you want to query all historical data for the first time, you can set the start-commit earlier, such as last year: 'read.start-commit' = '20211205152712'
select * from test_flink_incremental
/*+
options(
'read.streaming.enabled'='true',
'read.streaming.check-interval' = '4',
'read.start-commit' = '20211205152712'
)
*/
id name price ts dt
1 a1 10.0 1000 dt=2022-11-25
2 hudi2_update 20.0 2000 dt=2022-11-25
3 a3 30.0 3000 dt=2022-11-26
4 a4 40.0 4000 dt=2022-11-26
Verify continuity of streaming reads
Verify that the new incremental data comes in, whether it can continue to consume Hudi incremental data, and verify the accuracy and consistency of the data. In order to facilitate verification, I can use Flink SQL incremental stream to read the Hudi table and then sink it into the MySQL table, and finally pass the read The data in the MySQL table verifies the accuracy of the data
Flink SQL reads and writes MySQL. You need to configure the jar package. Just put flink-connector-jdbc_2.12-1.14.3.jar under lib. Download address: https://repo1.maven.org/maven2/org/apache/ flink/flink-connector-jdbc_2.12/1.14.3/flink-connector-jdbc_2.12-1.14.3.jar
First create a Sink table in MySQL
-- MySQL
CREATE TABLE `test_sink` (
`id` int(11),
`name` text DEFAULT NULL,
`price` int(11),
`ts` int(11),
`dt` text DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Create the corresponding sink table in Flink
create table test_sink (
id int,
name string,
price double,
ts bigint,
dt string
) with (
'connector' = 'jdbc',
'url' = 'jdbc:mysql://192.468.44.128:3306/hudi?useSSL=false&useUnicode=true&characterEncoding=UTF-8&characterSetResults=UTF-8',
'username' = 'root',
'password' = 'root-123',
'table-name' = 'test_sink',
'sink.buffer-flush.max-rows' = '1'
);
Then stream incrementally read the Hudi table Sink Mysql
insert into test_sink
select * from test_flink_incremental
/*+
options(
'read.streaming.enabled'='true',
'read.streaming.check-interval' = '4',
'read.start-commit' = '20221205152712'
)
*/
This will start a long task, which has been in the running state, we can verify this on the yarn-session interface
Then first verify the accuracy of the historical data in MySQL
Then use Spark SQL to insert two pieces of data into the source table
-- Spark SQL
insert into hudi.test_flink_incremental values (5,'a5', 50, 5000, '2022-12-07');
insert into hudi.test_flink_incremental values (6,'a6', 60, 6000, '2022-12-07');
The interval of our incremental reading is set to 4s. After successfully inserting data and waiting for 4s, verify the data in the MySQL table.
It is found that the newly added data has been successfully Sinked into MySQL, and the data is not repeated
Finally, verify the updated incremental data, Spark SQL updates the Hudi source table
-- Spark SQL
update hudi.test_flink_incremental set name='hudi5_update' where id = 5;
Continue to verify results
The result is that the updated incremental data will also be inserted into the sink table in MySQL, but the original data will not be updated
So what if you want to achieve the effect of updating? We need to add primary key fields to the sink tables of MySQL and Flink, both of which are indispensable, as follows
-- MySQL
CREATE TABLE `test_sink` (
`id` int(11),
`name` text DEFAULT NULL,
`price` int(11),
`ts` int(11),
`dt` text DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
-- Flink SQL
create table test_sink (
id int PRIMARY KEY NOT ENFORCED,
name string,
price double,
ts bigint,
dt string
) with (
'connector' = 'jdbc',
'url' = 'jdbc:mysql://192.468.44.128:3306/hudi?useSSL=false&useUnicode=true&characterEncoding=UTF-8&characterSetResults=UTF-8',
'username' = 'root',
'password' = 'root-123',
'table-name' = 'test_sink',
'sink.buffer-flush.max-rows' = '1'
);
Turn off the long task just started, re-execute the insert statement just now, run the historical data first, and finally verify the incremental effect
-- Spark SQL
update hudi.test_flink_incremental set name='hudi6_update' where id = 6;
insert into hudi.test_flink_incremental values (7,'a7', 70, 7000, '2022-12-07');
It can be seen that the expected effect is achieved, the update operation is performed for id=6, and the insert operation is performed for id=7
If this article is helpful to you, don't forget to "Like", "Like", and "Favorite" three times!
The Internet's worst era may indeed be here
I am studying in university at Bilibili, majoring in big data
What are we learning when we are learning Flink?
193 articles beat Flink violently, you need to pay attention to this collection
Flink production environment TOP problems and optimization, Alibaba Tibetan Scripture Pavilion YYDS
Flink CDC I'm sure Jesus can't keep him! | Flink CDC online problem inventory
What are we learning when we are learning Spark?
Among all Spark modules, I would like to call SparkSQL the strongest!
Hard Gang Hive | 40,000-word Basic Tuning Interview Summary
A Small Encyclopedia of Data Governance Methodologies and Practices
A small guide to user portrait construction under the label system
40,000-word long text | ClickHouse basics & practice & tuning full perspective analysis
Another decade begins in the direction of big data | The first edition of "Hard Gang Series" ends
Articles I have written about growth/interview/career advancement
What are we learning when we are learning Hive? "Hard Hive Sequel"