Executing the TPC-DS benchmark on Hive/Spark (PARQUET format)

In the previous article: "Running and Executing TPC-DS Benchmark Tests on Hive/Spark (ORC and TEXT Formats)" , we introduced how to use hive-testbench to execute TPC-DS benchmark tests on Hive/Spark, and also pointed out that the project does not support the parquet format.

If we want to generate test data in parquet format, we need to use other tools. This article chooses to use another open source project: https://github.com/kcheeeung/hive-benchmark , which is very close to the hive-testbench project, and the operation method is also very similar. If you are familiar with hive-testbench, it should be easy to master this tool.

Remarks: The Hive/Spark environment used in this article is AWS EMR, version: 6.11. This operation must be performed on the EMR Master node! Because command-line tools such as hdfs and beeline are used in the script, in addition, Glue Data Catalog cannot be enabled on the EMR cluster, otherwise an error will be reported during script execution:

insert image description here
You can use the EMR cluster without Glue Data Catalog enabled to create the database and test data generation first, and then

Guess you like

Origin blog.csdn.net/bluishglc/article/details/132306820