(WIP) Update Spark Optimization Practices for latest releases #29

secretazianman · 2024-06-11T17:20:49Z

Few notable changes:

As of EMR 6.11.0/Hadoop 3.3.3, EMR Scale down is not longer Spark shuffle or cache aware with default settings
- Set yarn.resourcemanager.decommissioning-nodes-watcher.wait-for-applications = True

Spot Instances + Spark Caching + Intermediate Tables
- A lost rdd block results in entire cache being recomputed, this can be expensive. Without caching, lost spark data can be incrementally recomputed, so may be better to not cache
- Consider only using on-demand if caching necessary
- Consider storing intermediate tables in HDFS or S3 instead. Consider cost of reading/writing/storing intermediate data vs re-computing intermediate data.
- Caching will cause spark evaluation at that point in time, potentially losing improvements from sql optimizer which occur when DAG is analyzed in entirety

Since EMR 6.6/Spark 3.2, default settings force AQE to run in legacy behavior, "to avoid performance regression when enabling adaptive query execution"
- Enable AQE by setting spark.sql.adaptive.coalescePartitions.parallelismFirst = false
- Now, in the Query plan you should see "AQEShuffleRead coalesced" if it is working
Optimizing AQE
- Set spark.sql.adaptive.coalescePartitions.initialPartitionNum to large number, such as 10x what you might set spark.sql.shuffle.partitions to. This allows AQE to have small enough initial partitions to optimize them using the advisoryPartitionSizeInBytes setting.
- Set spark.sql.adaptive.advisoryPartitionSizeInBytes by analyzing the resulting Task memory pressure on the Executor, consider increasing the value if memory is underutilized
- Optimization Example
  - Environment Setup
    - Instance Choice: r6.4xlarge
    - Core Units: 64 units
    - Task Units: 500 units
    - Spark Executor Memory: 32GB
    - Spark Executor Cores: 5
    - Spark.sql.adaptive.coalescePartitions.initialPartitionNum: 100,000
    - Dataset: S3 - 523GB - 2,700 files - Orc+Snappy - 4,584,646,650 rows
    - Spark Query performs wide joins
- Spark Shuffle = 10,000 and AQE Disabled
  - TODO
- AQE Enabled and advisoryPartitionSizeInBytes=64MB
  - TODO
- AQE Enabled and advisoryPartitionSizeInBytes=256MB
  - TODO
- AQE Enabled and advisoryPartitionSizeInBytes=512MB
  - TODO

mattliemAWS · 2024-06-12T22:03:31Z

Thanks! Your notes around spark + spot are super useful. let me figure out a way to incorporate some of these recommendations.

Provide feedback