Building a Unified Data Lake on Amazon EMR with Apache Flink

0d1a615fa0fee30ccb45dff8a921d262.gif

To build a data-driven enterprise, it is important to democratize enterprise data assets in a data catalog. With a unified data catalog, you can quickly search datasets and determine data schema, data format, and location. Amazon Glue Data Catalog provides a unified repository that enables disparate systems to store and find metadata to track data across data silos.

Apache Flink is a widely used data processing engine for scalable streaming ETL, analytics, and event-driven applications. The application provides precise time and state management with fault tolerance. Flink can process both bounded streams (batch processing) and unbounded streams (streaming processing) using a unified API or application. After processing data with Apache Flink, downstream applications can access curated data using a unified data catalog. With unified metadata, both data processing and data consuming applications can use the same metadata to access tables.

This post shows you how to integrate Apache Flink in Amazon EMR with Amazon Glue Data Catalog so that you can ingest streaming data in real time and access the data in near real time for business analysis.

Apache Flink connector and directory schema

Apache Flink uses connectors and catalogs to interact with data and metadata. The diagram below shows the Apache Flink connector for reading/writing data and the directory schema for reading/writing metadata.

832edfb092a70f6271a6118d39685dec.png

For reading/writing data, Flink provides the DynamicTableSourceFactory interface for read operations and the DynamicTableSinkFactory interface for write operations. There is also a Flink connector implementing two interfaces for accessing data in different stores. For example, Flink FileSystem connector provides FileSystemTableFactory for reading/writing data in Hadoop Distributed File System (HDFS) or Amazon Simple Storage Service (Amazon S3); Flink HBase connector provides HBase2DynamicTableFactory for HBase Read/write data; while the Flink Kafka connector provides KafkaDynamicTableFactory for reading/writing data in Kafka. You can refer to Table and SQL Connectors (https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/connectors/table/overview/) for more information.

For reading/writing metadata, Flink provides a directory interface. Flink has three built-in implementations of directories. GenericInMemoryCatalog stores catalog data in memory. JdbcCatalog stores catalog data in a JDBC-backed relational database. As of now, the JDBC catalog supports MySQL and PostgreSQL databases. HiveCatalog stores catalog data in Hive Metastore. HiveCatalog uses HiveShim to provide compatibility with different Hive versions. We can configure different metastore clients to use Hive metastore or Amazon Glue Data Catalog. In this post, we configure the Amazon EMR property (http://aws.amazon.com/emr) hive.metastore.client.factory.class to com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory (see Using Amazon Glue Data Catalog as Hive's Metastore, https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html), so we can use Amazon Glue Data Catalog to store Flink directory data. See the catalogs (https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/dev/table/catalogs/) for more information.

Most of Flink's built-in connectors (such as Kafka, Amazon Kinesis, Amazon DynamoDB, Elasticsearch or FileSystem) can use Flink HiveCatalog to store metadata in Amazon Glue Data Catalog. However, some connector implementations (such as Apache Iceberg) have separate directory management mechanisms. FlinkCatalog in Iceberg implements the catalog interface in Flink. FlinkCatalog in Iceberg provides an encapsulation mechanism for its own catalog implementation. The diagram below shows the relationship between Apache Flink, Iceberg connectors, and catalogs. See Creating Catalogs and Using Catalogs (https://iceberg.apache.org/docs/latest/flink/#creating-catalogs-and-using-catalogs) and Catalogs for more information.

e2de2ffd12936119c8a8f1e97fccece9.png

Apache Hudi also has its own catalog management capabilities. Both HoodieCatalog and HoodieHiveCatalog implement the catalog interface in Flink. HoodieCatalog stores metadata in a file system such as HDFS. HoodieHiveCatalog stores metadata in the Hive Metastore or the Amazon Glue Data Catalog, depending on whether you configured hive.metastore.client.factory.class to use com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory. The diagram below shows the relationship between Apache Flink, Hudi connectors, and catalogs. See Create Catalog (https://hudi.apache.org/docs/table_management#create-catalog) for more information.

88ebd91ccbf8e012b7b9c8fa703d8b18.png

Since Iceberg and Hudi have different catalog management mechanisms, we show three scenarios for integrating Flink with Amazon Glue Data Catalog in this article:

  • Read/write Iceberg tables in Flink using metadata from Glue Data Catalog

  • Read/write Hudi tables in Flink using metadata from Glue Data Catalog

  • Read/write other storage formats in Flink using metadata from the Glue Data Catalog

Solution overview

The diagram below shows the overall architecture of the solution described in this article.

62057826691f51c3b6aec23033a36a83.png

In this solution, we enable the Amazon RDS for MySQL binlog to pull transactional changes in real time. The Amazon EMR Flink CDC connector reads binlog data and processes the data. Transformed data can be stored in Amazon S3. We use Amazon Glue Data Catalog to store metadata such as table schema and table location. Downstream data consumer applications such as Amazon Athena or Amazon EMR Trino access the data for business analysis.

Here are the approximate steps to set up this solution:

  1. Enable binlog for Amazon RDS for MySQL and initialize the database.

  2. Create an EMR cluster using Amazon Glue Data Catalog.

  3. Ingest Change Data Capture (CDC, Change Data Capture) data using Apache Flink CDC in Amazon EMR.

  4. Store processed data in Amazon S3 and metadata in Amazon Glue Data Catalog.

  5. Confirm that all table metadata is stored in the Amazon Glue Data Catalog.

  6. Use data for business analytics with Athena or Amazon EMR Trino.

  7. Update and delete source records in Amazon RDS for MySQL and verify corresponding changes in data lake tables.

prerequisites

This post uses an Amazon Identity and Access Management (IAM) role with the following service permissions:

  • Amazon RDS for MySQL (5.7.40)

  • Amazon EMR (6.9.0)

  • Amazon Athena

  • Amazon Glue Data Catalog

  • Amazon S3

for Amazon RDS for MySQL 

Enable binlog and initialize the database

To enable CDC in Amazon RDS for MySQL, we need to configure binary logging for Amazon RDS for MySQL. For more information, see Configuring MySQL Binary Logging (https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_LogAccess.MySQL.BinaryFormat.html). We also create the database salesdb in MySQL and create the customer , order table and other tables in order to set up the data source.

1. On the Amazon RDS console, choose  Parameter groups in the navigation pane .

2. Create a new parameter group for MySQL.

3. Edit the parameter group you just created, setting binlog_format=ROW .

b3dbe031c17152a58b5244725eff6557.png

4. Edit the parameter group you just created, setting binlog_row_image=full .

7e9b637ce834dad662b9441cefd50b23.png

5. Create an RDS for MySQL DB instance using the parameter group.

6. Note the values ​​for hostname , username , and password , we will use these later.

7. Run the following command to download the MySQL database initialization script from Amazon S3:

aws s3 cp s3://emr-workshops-us-west-2/glue_immersion_day/scripts/salesdb.sql ./salesdb.sql

Swipe left to see more

8. Connect to the RDS for MySQL database and run the salesdb.sql command to initialize the database, providing the hostname and username according to your RDS for MySQL database configuration:

mysql -h <hostname> -u <username> -p
mysql> source salesdb.sql

Swipe left to see more

Using Amazon Glue Data Catalog 

Create an EMR cluster

Starting with Amazon EMR 6.9.0, Flink table API/SQL can be integrated with Amazon Glue Data Catalog. To use Flink's integration with Amazon Glue, you must create Amazon EMR 6.9.0 or later.

1. Create the file iceberg.properties for Amazon EMR Trino integration with Data Catalog. When the table format is Iceberg, your file should contain something like this:

iceberg.catalog.type=glue
connector.name=iceberg

2. Upload iceberg.properties to an S3 bucket, for example DOC-EXAMPLE-BUCKET .

For more information on how to integrate Amazon EMR Trino with Iceberg, see Using an Iceberg Cluster with Trino (https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-iceberg-use-trino- cluster.html).

3. Create the trino-glue-catalog-setup.sh file to configure Trino's integration with Data Catalog. Use trino-glue-catalog-setup.sh as bootstrap script. Your file should contain the following (replace DOC-EXAMPLE-BUCKET with your S3 bucket name):

set -ex 
sudo aws s3 cp s3://DOC-EXAMPLE-BUCKET/iceberg.properties /etc/trino/conf/catalog/iceberg.properties

Swipe left to see more

4. Upload trino-glue-catalog-setup.sh to the S3 bucket ( DOC-EXAMPLE-BUCKET ).

To run a bootstrap script, see Creating a Bootstrap Action to Install Additional Software (https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-bootstrap.html).

5. Create the flink-glue-catalog-setup.sh file to configure Flink integration with Data Catalog.

6. Using the script runner, run the flink-glue-catalog-setup.sh script as a step function.

Your file should contain the following (JAR file names here use Amazon EMR 6.9.0; later versions of JAR names may vary, so be sure to update for your Amazon EMR version).

Note that here we use an Amazon EMR step instead of bootstrap to run this script. After Amazon EMR Flink is provisioned, the Amazon EMR step script is run.

set -ex


sudo cp /usr/lib/hive/auxlib/aws-glue-datacatalog-hive3-client.jar /usr/lib/flink/lib
sudo cp /usr/lib/hive/lib/antlr-runtime-3.5.2.jar /usr/lib/flink/lib
sudo cp /usr/lib/hive/lib/hive-exec.jar /lib/flink/lib
sudo cp /usr/lib/hive/lib/libfb303-0.9.3.jar /lib/flink/lib
sudo cp /usr/lib/flink/opt/flink-connector-hive_2.12-1.15.2.jar /lib/flink/lib
sudo chmod 755 /usr/lib/flink/lib/aws-glue-datacatalog-hive3-client.jar
sudo chmod 755 /usr/lib/flink/lib/antlr-runtime-3.5.2.jar
sudo chmod 755 /usr/lib/flink/lib/hive-exec.jar
sudo chmod 755 /usr/lib/flink/lib/libfb303-0.9.3.jar
sudo chmod 755 /usr/lib/flink/lib/flink-connector-hive_2.12-1.15.2.jar


sudo wget https://repo1.maven.org/maven2/com/ververica/flink-sql-connector-mysql-cdc/2.2.1/flink-sql-connector-mysql-cdc-2.2.1.jar -O /lib/flink/lib/flink-sql-connector-mysql-cdc-2.2.1.jar
sudo chmod 755 /lib/flink/lib/flink-sql-connector-mysql-cdc-2.2.1.jar


sudo ln -s /usr/share/aws/iceberg/lib/iceberg-flink-runtime.jar /usr/lib/flink/lib/
sudo ln -s /usr/lib/hudi/hudi-flink-bundle.jar /usr/lib/flink/lib/


sudo mv /usr/lib/flink/opt/flink-table-planner_2.12-1.15.2.jar /usr/lib/flink/lib/
sudo mv /usr/lib/flink/lib/flink-table-planner-loader-1.15.2.jar /usr/lib/flink/opt/

Swipe left to see more

7. Upload flink-glue-catalog-setup.sh to the S3 bucket ( DOC-EXAMPLE-BUCKET ).

For more information on how to configure Flink and the Hive metastore, see Configuring Flink as a Hive Metastore in Amazon EMR (https://docs.aws.amazon.com/emr/latest/ReleaseGuide/flink-configure.html).

For more details on running Amazon EMR step scripts, see Running Commands and Scripts on Amazon EMR Clusters (https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-commandrunner.html).

8. Create an EMR 6.9.0 cluster with Hive, Flink, and Trino applications.

You can create an EMR cluster using the Amazon Command Line Interface (Amazon CLI) or the Amazon Management Console. For instructions, see the appropriate subsection.

Create an EMR cluster using the Amazon CLI

To use the Amazon CLI, complete the following steps:

1. Create the emr-flink-trino-glue.json file to configure Amazon EMR to use Data Catalog. Your file should contain the following:

[
{
"Classification": "hive-site",
"Properties": {
"hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
}
},
{
"Classification": "trino-connector-hive",
"Properties": {
"hive.metastore": "glue"
}
}
]

Swipe left to see more

2. Run the following command to create an EMR cluster. Provide your local emr-flink-trino-glue.json parent folder path, S3 bucket, EMR cluster region, EC2 key name, and S3 bucket where EMR logs are stored.

aws emr create-cluster --release-label emr-6.9.0 \
--applications Name=Hive Name=Flink Name=Spark Name=Trino \
--region us-west-2 \
--name flink-trino-glue-emr69 \
--configurations "file:///<your configuration path>/emr-flink-trino-glue.json" \
--bootstrap-actions '[{"Path":"s3://DOC-EXAMPLE-BUCKET/trino-glue-catalog-setup.sh","Name":"Add iceberg.properties for Trino"}]' \
--steps '[{"Args":["s3://DOC-EXAMPLE-BUCKET/flink-glue-catalog-setup.sh"],"Type":"CUSTOM_JAR","ActionOnFailure":"CONTINUE","Jar":"s3://<region>.elasticmapreduce/libs/script-runner/script-runner.jar","Properties":"","Name":"Flink-glue-integration"}]' \
--instance-groups \
InstanceGroupType=MASTER,InstanceType=m6g.2xlarge,InstanceCount=1 \
InstanceGroupType=CORE,InstanceType=m6g.2xlarge,InstanceCount=2 \
--use-default-roles \
--ebs-root-volume-size 30 \
--ec2-attributes KeyName=<keyname> \
--log-uri s3://<s3-bucket-for-emr>/elasticmapreduce/

Swipe left to see more

Create an EMR cluster on the console

To use the console, complete the following steps:

1. On the Amazon EMR console, create an EMR cluster, and then  select  Use for Hive table metadata in  the Amazon Glue Data Catalog settings (for Hive table metadata).

2. Add configuration settings using the following code:

[
{
"Classification": "trino-connector-hive",
"Properties": {
"hive.metastore": "glue"
}
}
]

Swipe left to see more

8c39a9fc383c3d2b225593985286739c.png

3. In  the Steps section, add a  step called Custom JAR .

4.  Set the JAR location to s3://<region>.elasticmapreduce/libs/script-runner/script-runner.jar , where <region> is the region where the EMR cluster resides.

5.  Set Arguments to the S3 path you uploaded earlier.

7dac3f0ddb04934b346c45c35d570f8a.png

6. In  the Bootstrap Actions section, choose Custom Action  .

7. Set  the Script location to the S3 path you uploaded.

c5bb96b1b93ca0a228f9415c47c3ea1f.png

8. Proceed to the next steps to complete the creation of the EMR cluster.

In Amazon EMR

Extract CDC data using Apache Flink CDC

The Flink CDC connector supports reading database snapshots and capturing updates in configured tables. We have downloaded flink-sql-connector-mysql-cdc-2.2.1.jar (https://repo1.maven.org/maven2/com/ververica/flink-sql-connector-mysql-cdc/2.2.1/ flink-sql-connector-mysql-cdc-2.2.1.jar) and put it into the Flink library when creating the EMR cluster, deployed the Flink CDC connector for MySQL. The Flink CDC connector can use the Flink Hive catalog to store Flink CDC table schemas into Hive Metastore or Amazon Glue Data Catalog. In this post, we use Data Catalog to store our Flink CDC tables.

Complete the following steps to use Flink CDC to extract RDS for MySQL databases and tables, and store metadata in Data Catalog:

1. Connect to the EMR master node via SSH.

2. Start Flink on the YARN session by running the following command, providing your S3 bucket name:

flink-yarn-session -d -jm 2048 -tm 4096 -s 2 \
-D state.backend=rocksdb \
-D state.backend.incremental=true \
-D state.checkpoint-storage=filesystem \
-D state.checkpoints.dir=s3://<flink-glue-integration-bucket>/flink-checkponts/ \
-D state.checkpoints.num-retained=10 \
-D execution.checkpointing.interval=10s \
-D execution.checkpointing.mode=EXACTLY_ONCE \
-D execution.checkpointing.externalized-checkpoint-retention=RETAIN_ON_CANCELLATION \
-D execution.checkpointing.max-concurrent-checkpoints=1

Swipe left to see more

3. Run the following command to start the Flink SQL client CLI:

/usr/lib/flink/bin/sql-client.sh embedded

Swipe left to see more

4. Create a Flink Hive directory by specifying the directory type as hive and providing your S3 bucket name:

CREATE CATALOG glue_catalog WITH (
'type' = 'hive',
'default-database' = 'default',
'hive-conf-dir' = '/etc/hive/conf.dist'
);
USE CATALOG glue_catalog;
CREATE DATABASE IF NOT EXISTS flink_cdc_db WITH ('hive.database.location-uri'= 's3://<flink-glue-integration-bucket>/flink-glue-for-hive/warehouse/')
use flink_cdc_db;

Swipe left to see more

Because we configure the EMR Hive catalog with Amazon Glue Data Catalog, all databases and tables created in the Flink Hive catalog are stored in the Data Catalog.

5. Create a Flink CDC table, providing the hostname, username, and password of the RDS for MySQL instance you created earlier.

Note that since usernames and passwords for RDS for MySQL will be stored in the Data Catalog as table attributes, you should enable Amazon Glue database/table authorization using Amazon Lake Formation in order to protect your sensitive data.

CREATE TABLE `glue_catalog`.`flink_cdc_db`.`customer_cdc` (
`CUST_ID` double NOT NULL,
`NAME` STRING NOT NULL,
`MKTSEGMENT` STRING NOT NULL,
PRIMARY KEY (`CUST_ID`) NOT ENFORCED
) WITH (
'connector' = 'mysql-cdc',
'hostname' = '<hostname>',
'port' = '3306',
'username' = '<username>',
'password' = '<password>',
'database-name' = 'salesdb',
'table-name' = 'CUSTOMER'
);


CREATE TABLE `glue_catalog`.`flink_cdc_db`.`customer_site_cdc` (
`SITE_ID` double NOT NULL,
`CUST_ID` double NOT NULL,
`ADDRESS` STRING NOT NULL,
`CITY` STRING NOT NULL,
`STATE` STRING NOT NULL,
`COUNTRY` STRING NOT NULL,
`PHONE` STRING NOT NULL,
PRIMARY KEY (`SITE_ID`) NOT ENFORCED
) WITH (
'connector' = 'mysql-cdc',
'hostname' = '<hostname>',
'port' = '3306',
'username' = '<username>',
'password' = '<password>',
'database-name' = 'salesdb',
'table-name' = 'CUSTOMER_SITE'
);


CREATE TABLE `glue_catalog`.`flink_cdc_db`.`sales_order_all_cdc` (
`ORDER_ID` int NOT NULL,
`SITE_ID` double NOT NULL,
`ORDER_DATE` TIMESTAMP NOT NULL,
`SHIP_MODE` STRING NOT NULL
) WITH (
'connector' = 'mysql-cdc',
'hostname' = '<hostname>',
'port' = '3306',
'username' = '<username>',
'password' = '<password>',
'database-name' = 'salesdb',
'table-name' = 'SALES_ORDER_ALL',
'scan.incremental.snapshot.enabled' = 'FALSE'
);

Swipe left to see more

6. Query the table you just created:

SELECT count(O.ORDER_ID) AS ORDER_COUNT,
C.CUST_ID,
C.NAME,
C.MKTSEGMENT
FROM   customer_cdc C
JOIN customer_site_cdc CS
ON C.CUST_ID = CS.CUST_ID
JOIN sales_order_all_cdc O
ON O.SITE_ID = CS.SITE_ID
GROUP  BY C.CUST_ID,
C.NAME,
C.MKTSEGMENT;

Swipe left to see more

You will get a query result as shown in the screenshot below.

e6f019c793479c1b5d759155ca7098d7.png

Store the processed data in Amazon S3,

and store metadata in the Data Catalog

When we extract relational database data in Amazon RDS for MySQL, the original data may be updated or deleted. To support data update and deletion, we can choose a data lake technology such as Apache Iceberg or Apache Hudi to store the processed data. As we mentioned before, Iceberg and Hudi have different catalog management mechanisms. We show two scenarios of using Flink to read/write Iceberg and Hudi tables that contain metadata in the Amazon Glue Data Catalog.

For non-Iceberg and non-Hudi, we use FileSystem Parquet files to show how Flink's built-in connectors use Data Catalog.

Using metadata from the Glue Data Catalog,

Read/write Iceberg tables in Flink

The diagram below shows the architecture of this configuration.

60ee309fa435c8b8389efd3876af4696.png

1. Create a Flink Iceberg catalog using Data Catalog by specifying the catalog-impl as org.apache.iceberg.aws.glue.GlueCatalog.

For more information on Iceberg's Flink and Data Catalog integration, see Glue Catalog (https://iceberg.apache.org/docs/latest/aws/#glue-catalog).

2. In the Flink SQL client CLI, run the following command, providing your S3 bucket name:

CREATE CATALOG glue_catalog_for_iceberg WITH (
'type'='iceberg',
'warehouse'='s3://<flink-glue-integration-bucket>/flink-glue-for-iceberg/warehouse/',
'catalog-impl'='org.apache.iceberg.aws.glue.GlueCatalog',
'io-impl'='org.apache.iceberg.aws.s3.S3FileIO',
'lock-impl'='org.apache.iceberg.aws.glue.DynamoLockManager',
'lock.table'='FlinkGlue4IcebergLockTable' );

Swipe left to see more

3. Create an Iceberg table to store the processed data:

USE CATALOG glue_catalog_for_iceberg;
CREATE DATABASE IF NOT EXISTS flink_glue_iceberg_db;
USE flink_glue_iceberg_db;
CREATE TABLE `glue_catalog_for_iceberg`.`flink_glue_iceberg_db`.`customer_summary` (
`CUSTOMER_ID` bigint,
`NAME` STRING,
`MKTSEGMENT` STRING,
`COUNTRY` STRING,
`ORDER_COUNT` BIGINT,
PRIMARY KEY (`CUSTOMER_ID`) NOT Enforced
)
WITH (
'format-version'='2',
'write.upsert.enabled'='true');

Swipe left to see more

4. Insert the processed data into Iceberg:

INSERT INTO `glue_catalog_for_iceberg`.`flink_glue_iceberg_db`.`customer_summary`
SELECT CAST(C.CUST_ID AS BIGINT) CUST_ID,
C.NAME,
C.MKTSEGMENT,
CS.COUNTRY,
count(O.ORDER_ID) AS ORDER_COUNT
FROM   `glue_catalog`.`flink_cdc_db`.`customer_cdc` C
JOIN `glue_catalog`.`flink_cdc_db`.`customer_site_cdc` CS
ON C.CUST_ID = CS.CUST_ID
JOIN `glue_catalog`.`flink_cdc_db`.`sales_order_all_cdc` O
ON O.SITE_ID = CS.SITE_ID
GROUP  BY C.CUST_ID,
C.NAME,
C.MKTSEGMENT,
CS.COUNTRY;

Swipe left to see more

Using metadata from the Glue Data Catalog,

Read/write Hudi tables in Flink

The diagram below shows the architecture of this configuration.

445a341ddf5881ea61333c0cea860fb9.png

Please complete the following steps:

1. Create a directory for Hudi that uses the Hive directory by specifying the mode as hms .

Because we configured Amazon EMR to use Data Catalog when we created the EMR cluster, this Hudi Hive catalog uses Data Catalog behind the scenes. For more information on Hudi's Flink and Data Catalog integration, see Creating a Catalog.

2. In the Flink SQL client CLI, run the following command, providing your S3 bucket name:

CREATE CATALOG glue_catalog_for_hudi WITH (
'type' = 'hudi',
'mode' = 'hms',
'table.external' = 'true',
'default-database' = 'default',
'hive.conf.dir' = '/etc/hive/conf.dist',
'catalog.path' = 's3://<flink-glue-integration-bucket>/flink-glue-for-hudi/warehouse/'
);

Swipe left to see more

3. Create a Hudi table using Data Catalog, providing your S3 bucket name:

USE CATALOG glue_catalog_for_hudi;
CREATE DATABASE IF NOT EXISTS flink_glue_hudi_db;
use flink_glue_hudi_db;
CREATE TABLE `glue_catalog_for_hudi`.`flink_glue_hudi_db`.`customer_summary` (
`CUSTOMER_ID` bigint,
`NAME` STRING,
`MKTSEGMENT` STRING,
`COUNTRY` STRING,
`ORDER_COUNT` BIGINT,
PRIMARY KEY (`CUSTOMER_ID`) NOT Enforced
)
WITH (
'connector' = 'hudi',
'write.tasks' = '4',
'path' = 's3://<flink-glue-integration-bucket>/flink-glue-for-hudi/warehouse/customer_summary',
'table.type' = 'COPY_ON_WRITE',
'read.streaming.enabled' = 'true',
'read.streaming.check-interval' = '1'
);

Swipe left to see more

4. Insert the processed data into Hudi:

INSERT INTO `glue_catalog_for_hudi`.`flink_glue_hudi_db`.`customer_summary`
SELECT CAST(C.CUST_ID AS BIGINT) CUST_ID,
C.NAME,
C.MKTSEGMENT,
CS.COUNTRY,
count(O.ORDER_ID) AS ORDER_COUNT
FROM   `glue_catalog`.`flink_cdc_db`.`customer_cdc` C
JOIN `glue_catalog`.`flink_cdc_db`.`customer_site_cdc` CS
ON C.CUST_ID = CS.CUST_ID
JOIN `glue_catalog`.`flink_cdc_db`.`sales_order_all_cdc` O
ON O.SITE_ID = CS.SITE_ID
GROUP  BY C.CUST_ID,
C.NAME,
C.MKTSEGMENT,
CS.COUNTRY;

Swipe left to see more

Using metadata from the Glue Data Catalog,

Read/write other storage formats in Flink

The diagram below shows the architecture of this configuration.

0660f1596866a652b9dfb4594f600a98.png

We already created the Flink Hive directory in the previous step, so we will reuse that.

1. In the Flink SQL client CLI, run the following command:

USE CATALOG glue_catalog;
CREATE DATABASE IF NOT EXISTS flink_hive_parquet_db;
use flink_hive_parquet_db;

Swipe left to see more

We change the SQL dialect to Hive to create a table using Hive syntax.

2. Create a table using the following SQL, providing your S3 bucket name:

SET table.sql-dialect=hive;


CREATE TABLE `customer_summary` (
`CUSTOMER_ID` bigint,
`NAME` STRING,
`MKTSEGMENT` STRING,
`COUNTRY` STRING,
`ORDER_COUNT` BIGINT
)
STORED AS parquet
LOCATION 's3://<flink-glue-integration-bucket>/flink-glue-for-hive-parquet/warehouse/customer_summary';

Swipe left to see more

Since Parquet files do not support updated rows, we cannot use data from CDC data. However, we can use data from Iceberg or Hudi.

3. Use the following code to query the Iceberg table and insert data into the Parquet table:

SET table.sql-dialect=default;
SET execution.runtime-mode = batch;
INSERT INTO `glue_catalog`.`flink_hive_parquet_db`.`customer_summary`
SELECT * from `glue_catalog_for_iceberg`.`flink_glue_iceberg_db`.`customer_summary`;

Swipe left to see more

Confirm all table metadata

are stored in the Data Catalog

You can navigate to the Amazon Glue console to confirm that all tables are stored in the Data Catalog.

1. On the Amazon Glue console, choose  Databases in the navigation pane to list all the databases we created.

0723c9431b5c5ae9bf40d101bec4f0ad.png

2. Open a database and verify that all tables exist in the database.

ce26a13341ed6f12a435e22ac2691932.png

Via Athena or Amazon EMR Trino 

Using data for business analysis

You can use Athena or Amazon EMR Trino to access the resulting data.

Query data with Athena

To access data using Athena, complete the following steps:

1. Open the Athena Query Editor.

2. Select flink_glue_iceberg_db for  Database .

You should see the customer_summary table listed.

3. Run the following SQL script to query the Iceberg result table:

select * from customer_summary order by order_count desc limit 10

Swipe left to see more

The query results will look like the following screenshot.

e12ca46d8caa066a628e9ce26b81d904.png

4. For the Hudi table, change  the Database to flink_glue_hudi_db and run the same SQL query.

ee7dab75dc0f91805aa4eed3c9d1e69d.png

5. For Parquet tables, change  Database to flink_hive_parquet_db and run the same SQL query.

a6872afb7b2154838c3e0f55383349bf.png

Query Data Using Amazon EMR Trino

To access Iceberg using Amazon EMR Trino, connect to the EMR master node via SSH.

1. Run the following command to start the Trino CLI:

trino-cli --catalog iceberg

Amazon EMR Trino can now query tables in the Amazon Glue Data Catalog.

2. Run the following command to query the results table:

show schemas;
use flink_glue_iceberg_db;
show tables;
select * from customer_summary order by order_count desc limit 10;

Swipe left to see more

The query result looks like the screenshot below.

2453ad5dc5392baf43b1867093da0ed9.png

3. Exit the Trino CLI.

4. Start the Trino CLI using the hive catalog to query the Hudi tables:

trino-cli --catalog hive

5. Run the following command to query the Hudi table:

show schemas;
use flink_glue_hudi_db;
show tables;
select * from customer_summary order by order_count desc limit 10;

Swipe left to see more

Update and delete source records in Amazon RDS for MySQL and verify corresponding changes in data lake tables

We can update and delete some records in the RDS for MySQL database and then verify that the changes are reflected in the Iceberg and Hudi tables.

1. Connect to the RDS for MySQL database and run the following SQL:

update CUSTOMER set NAME = 'updated_name' where CUST_ID=7;


delete from CUSTOMER where CUST_ID=11;

Swipe left to see more

2. Use Athena or Amazon EMR Trino to query the customer_summary table.

Updated and deleted records are reflected in the Iceberg and Hudi tables.

0c1cb8c2305ee6445909ffbb2e01d19c.png

to clean up

After completing this exercise, complete the following steps to delete your resource and stop accruing charges:

1. Delete the RDS for MySQL database.

2. Delete the EMR cluster.

3. Delete the database and tables created in Data Catalog.

4. Delete the file in Amazon S3.

summary

This post shows you how to integrate Apache Flink in Amazon EMR with Amazon Glue Data Catalog. You can use the Flink SQL connector to read/write data in different storage areas such as Kafka, CDC, HBase, Amazon S3, Iceberg or Hudi. You can also store metadata in Data Catalog. The Flink table API has the same connector and catalog implementation mechanism. In a single session, we can use multiple instances of catalogs pointing to different types (such as IcebergCatalog and HiveCatalog ), and then use them interchangeably in queries. You can also write code using the Flink Table API to develop the same solution integrating Flink and Data Catalog.

In our solution, we used RDS for MySQL binary logs directly through Flink CDC. You can also use binary logs with MySQL Debezim using Amazon MSK Connect and store the data in Amazon Managed Streaming for Apache Kafka (Amazon MSK). For more information, see Create a Low-Latency Source-to-Data Lake Pipeline Using Amazon MSK Connect, Apache Flink, and Apache Hudi (https://aws.amazon.com/blogs/big-data/create-a-low-latency -source-to-data-lake-pipeline-using-amazon-msk-connect-apache-flink-and-apache-hudi/).

With Amazon EMR Flink's unified batch and streaming data processing capabilities, you can ingest and process data through a single compute engine. By integrating Apache Iceberg and Hudi into Amazon EMR, you can build an evolvable and scalable data lake. With Amazon Glue Data Catalog, you can centrally manage all enterprise data catalogs and consume data easily.

Follow the steps in this post to build a unified batch and stream processing solution using Amazon EMR Flink and Amazon Glue Data Catalog. If you have any questions, please leave a comment.

The author of this article

238876249a7925f895e1e8dd4aeeb568.jpeg

Jianwei Li

Advanced Analytics Specialist at TAM. He advises Amazon Enterprise Support customers on designing and building modern data platforms.

ccb9384bf3265153ba9d0f35baf19556.jpeg

Samrat Deb

 Software Development Engineer at Amazon EMR. In his free time, he enjoys exploring new places, different cultures and cuisines.

c9752d2e535ed6602fa02dd82713b4ed.jpeg

Prabhu Joseph Raj 

is a Senior Software Development Engineer at Amazon EMR. He focuses on leading teams building solutions in Apache Hadoop and Apache Flink. In his free time, Prabhu enjoys spending time with his family.

7989c2ede22d3a677b4760a491cb1944.gif

004646c8a930f53ea9fef8c52283861a.gif

I heard, click the 4 buttons below

You will not encounter bugs!

d08543b3bb7a57e8cc261e4b55b9ded9.gif

Guess you like

Origin blog.csdn.net/u012365585/article/details/131714777