Spark SQL processing small files

Only production DataNode 7, one for each file number threshold value datanode 50w block
that is able to accommodate a total of the entire cluster 7 * 50w = 350w / 3 = 120w block copy!
There partition tables in chronological = 10 On December 365 days = 4.4W block 25 only memory table.
Distribution under normal circumstances:
10 000 20 *,
1000 * 200,
100 * 2000,
10 * 20000,
actual production will have a lot of small files, taking up cluster resources, for headache must be treated properly , as follows:
Here Insert Picture Description
a method using a repartition (spark2.4 after)

spark.sql("create table table1 as select /*+ REPARTITION(4) */ * from table_1 where age >18 ")

The method using two combined hive default mode (configuration parallelism)

insert overwrite table table_1 select * from table_1;
#分区表
insert overwrite table table_1 partition (day_time='2018-09-01') select * from table_1 where day_time = '2019-08-01';

The actual situation:
100000000 4OG about the amount of data (amount of data), according to the current cluster performance:
repartition (4) takes about 25min
repartition (10) takes about 10min
if employed repartition (4) mode, a co-generation workflow 3 a wide tables calculated, if desired, additional cost 25 * 3 = 75min
if employed repartition (10) mode, generating a total of three wide workflow table calculation, if desired, additional cost 10 * 3 = 30min
That repartition reduced form of small files not ideal, actual or need to write a separate program processing of the number of small files. Then the timer task is executed once a month! ! !

Published 118 original articles · won praise 25 · Views 150,000 +

Guess you like

Origin blog.csdn.net/lhxsir/article/details/99588064