Table of contents
Let’s first understand what kind of memory there is
1.storage内存 存储数据,缓存 可预估
2.shuffle内存 计算join groupby 不可预估
spark1.6之前 静态管理的,spark1.6之后变成动态管理 默认0.5
Kind tips
Try not to write RDD in the company (poor performance)
RDD demonstration (spark version 2.1.1)
We convert it to rdd to run the task and see how much memory it occupies
We can also go to excutor to see the memory size
It shows red because I Wrote a while loop
RDD optimization
See the official website
https://spark.apache.org/docs/2.4.5/configuration.html#compression-and-serialization
We use kryo(只支持rdd)
We need to look at the cache level of rdd
https://spark.apache.org/docs/2.4.5/rdd-programming-guide .html#which-storage-level-to-choose
Using the serialized cache level
I found that 1.7g directly became 270m. The optimization is still quite big. !
Df and Ds for demonstration
See the official website
https://spark.apache.org/docs/2.4.5/sql-getting-started.html#creating-datasets
Ds will specifically use its own bias code for serialization
Memory size 34.2M
We can also serialize (little change)
33.9M after optimization