You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A lost rdd block results in entire cache being recomputed, this can be expensive. Without caching, lost spark data can be incrementally recomputed, so may be better to not cache
Consider only using on-demand if caching necessary
Consider storing intermediate tables in HDFS or S3 instead. Consider cost of reading/writing/storing intermediate data vs re-computing intermediate data.
Caching will cause spark evaluation at that point in time, potentially losing improvements from sql optimizer which occur when DAG is analyzed in entirety
AQE
Since EMR 6.6/Spark 3.2, default settings force AQE to run in legacy behavior, "to avoid performance regression when enabling adaptive query execution"
Enable AQE by setting spark.sql.adaptive.coalescePartitions.parallelismFirst = false
Now, in the Query plan you should see "AQEShuffleRead coalesced" if it is working
Optimizing AQE
Set spark.sql.adaptive.coalescePartitions.initialPartitionNum to large number, such as 10x what you might set spark.sql.shuffle.partitions to. This allows AQE to have small enough initial partitions to optimize them using the advisoryPartitionSizeInBytes setting.
Set spark.sql.adaptive.advisoryPartitionSizeInBytes by analyzing the resulting Task memory pressure on the Executor, consider increasing the value if memory is underutilized
Few notable changes:
Managed Scaling
Spot Instances
AQE
Since EMR 6.6/Spark 3.2, default settings force AQE to run in legacy behavior, "to avoid performance regression when enabling adaptive query execution"
Optimizing AQE
The text was updated successfully, but these errors were encountered: