Use a stub to store Spark StageInfo #1525

amahussein · 2025-02-04T16:58:49Z

Signed-off-by: Ahmed Hussein (amahussein) [email protected]

This commit uses a smaller class StageInfoStub to store Spark's StageInfo. This class is common between all the spark implementations but it has more fields to the constructor across different versions.

Currently we only use a subset of the class fields. The remaining fields represent an overhead or redundant storage; especially when it comes to store the accumulables and taskMetrics for each stage.

To evaluate the memory optimization, a new Checkpoint mechanism was added to allow gathering information at separate stages of the execution.
The checkpoint design and implementation can be further improved and extended to build a performance profile to compare different tradeoffs.

This pull request introduces a new runtime checkpointing feature for performance benchmarking and debugging in the RAPIDS tools. The main changes include adding new classes for runtime checkpoints, updating build properties, and modifying existing classes to integrate the new checkpointing functionality.

Memory evaluation:

In order to enable the runtime injection set the build property benchmarks.checkpoints to dev. This is achieved by changing the pom.xml file or passing the property as an arhument to the mvn command.
This property will dump the free memory after the eventlog is processed.
An initial test on a sample eventlog showed that it saved approximately ~ 7MB. Of course, this resulkt is a function of the number of completed stages in the eventlog and the accumulables associated with each one.

Code Changes

Main core changes

core/src/main/scala/org/apache/spark/sql/rapids/tool/util/stubs/StageInfoStub.scala: Added a new class StageInfoStub to provide a consistent interface for StageInfo across different Spark versions.
core/src/main/scala/org/apache/spark/sql/rapids/tool/store/StageModel.scala: Updated StageModel to use StageInfoStub for compatibility with different Spark versions. [1] [2] [3]

Build and Configuration Updates:

core/pom.xml: Added a new property benchmarks.checkpoints to manage the checkpointing feature.
core/src/main/resources/configs/build.properties: Updated build properties to include benchmarks.checkpoints. [1] [2]

Runtime Checkpointing Feature:

core/src/main/scala/org/apache/spark/rapids/tool/benchmarks/DevRuntimeCheckpoint.scala: Added a new class DevRuntimeCheckpoint to insert memory markers and print memory information during runtime for performance metrics.
core/src/main/scala/org/apache/spark/rapids/tool/benchmarks/NoOpRuntimeCheckpoint.scala: Added a new class NoOpRuntimeCheckpoint as a default no-operation implementation for checkpoints.
core/src/main/scala/org/apache/spark/rapids/tool/benchmarks/RuntimeCheckpointTrait.scala: Introduced a new trait RuntimeCheckpointTrait defining the API for inserting runtime checkpoints.
core/src/main/scala/org/apache/spark/rapids/tool/benchmarks/RuntimeInjector.scala: Added a new object RuntimeInjector to manage and insert runtime checkpoints based on build properties.

Integration of Checkpoints:

core/src/main/scala/org/apache/spark/sql/rapids/tool/AppBase.scala: Integrated RuntimeInjector to insert a memory marker after processing events. [1] [2]
core/src/main/scala/org/apache/spark/sql/rapids/tool/util/RuntimeUtil.scala: Added a method getJVMHeapInfo to retrieve JVM heap information. [1] [2]

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]> Fixes NVIDIA#1524 This commit uses a smaller class `StageInfoStub` to store Spark's StageInfo. This class is common between all the spark implementations but it has more fields to the constructor across different versions. Currently we only use a subset of the class fields. The remaining fields represent an overhead or redundant storage; especially when it comes to store the accumulables and taskMetrics for each stage. To evaluate the memory optimization, a new `Checkpoint` mechanism was added to allow gathering information at separate stages of the execution. The `checkpoint` design and implementation can be further improved and extended to build a performance profile to compare different tradeoffs.

parthosa

Thanks @amahussein. A minor typo. This framework for logging memory used looks great.

core/src/main/scala/org/apache/spark/rapids/tool/benchmarks/DevRuntimeCheckpoint.scala

core/src/main/scala/org/apache/spark/sql/rapids/tool/AppBase.scala

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

amahussein

Thanks @parthosa

core/src/main/scala/org/apache/spark/sql/rapids/tool/AppBase.scala

core/src/main/scala/org/apache/spark/rapids/tool/benchmarks/DevRuntimeCheckpoint.scala

parthosa

Thanks @amahussein. LGTME.

sayedbilalbari · 2025-02-05T18:51:29Z

core/src/main/scala/org/apache/spark/sql/rapids/tool/store/StageModel.scala

-  private def initStageInfo(newStageInfo: StageInfo): StageInfo = {
-    newStageInfo
+  private def initStageInfo(newStageInfo: StageInfo): StageInfoStub = {
+    StageInfoStub.fromStageInfo(newStageInfo)


@amahussein Currently we are reassigning the StageInfo object and updating the StageModel class with the incoming StageInfo object.
Now that we are using a Stub and creating a new object, can we not use the existing Stub object in case of updates to StageModel and just update its variables. Currently we are doing a new Stub allocation in all the cases.

mmm, yeah it is possible we do only update to some fields that get changed.
The idea that:

This update should only happens a single time when the stage is completed. This implies that this is not very frequent event.

Updating some fields could lead to bugs. When we extend this object in the future, the dev will have to make sure that they are handling the fields correctly (which one could be updated vs which one are not).

allocating the new object in that case made the code look easier especially to maintain moving fwd.

amahussein

Thanks @sayedbilalbari

amahussein · 2025-02-05T21:36:46Z

core/src/main/scala/org/apache/spark/sql/rapids/tool/store/StageModel.scala

-  private def initStageInfo(newStageInfo: StageInfo): StageInfo = {
-    newStageInfo
+  private def initStageInfo(newStageInfo: StageInfo): StageInfoStub = {
+    StageInfoStub.fromStageInfo(newStageInfo)


mmm, yeah it is possible we do only update to some fields that get changed.
The idea that:

This update should only happens a single time when the stage is completed. This implies that this is not very frequent event.

Updating some fields could lead to bugs. When we extend this object in the future, the dev will have to make sure that they are handling the fields correctly (which one could be updated vs which one are not).

allocating the new object in that case made the code look easier especially to maintain moving fwd.

cindyyuanjiang

Thanks @amahussein! A very minor nit.

cindyyuanjiang · 2025-02-06T08:49:57Z

core/src/main/scala/org/apache/spark/rapids/tool/benchmarks/RuntimeInjector.scala

+    } else { // loads the noOp implementation by default
+      new NoOpRuntimeCheckpoint
+    }
+  }


nit: new line between 2 defs

amahussein added core_tools Scope the core module (scala) performance performance and scalability of tools labels Feb 4, 2025

amahussein requested review from parthosa and sayedbilalbari February 4, 2025 16:58

amahussein self-assigned this Feb 4, 2025

amahussein requested a review from cindyyuanjiang February 4, 2025 17:04

parthosa reviewed Feb 5, 2025

View reviewed changes

core/src/main/scala/org/apache/spark/rapids/tool/benchmarks/DevRuntimeCheckpoint.scala Outdated Show resolved Hide resolved

core/src/main/scala/org/apache/spark/sql/rapids/tool/AppBase.scala Show resolved Hide resolved

Fix typo in DevRuntimeCheckpoint

8f2122a

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

amahussein commented Feb 5, 2025

View reviewed changes

core/src/main/scala/org/apache/spark/sql/rapids/tool/AppBase.scala Show resolved Hide resolved

core/src/main/scala/org/apache/spark/rapids/tool/benchmarks/DevRuntimeCheckpoint.scala Outdated Show resolved Hide resolved

parthosa approved these changes Feb 5, 2025

View reviewed changes

sayedbilalbari reviewed Feb 5, 2025

View reviewed changes

amahussein commented Feb 5, 2025

View reviewed changes

cindyyuanjiang approved these changes Feb 6, 2025

View reviewed changes

amahussein merged commit 14255f4 into NVIDIA:dev Feb 6, 2025
13 checks passed

amahussein deleted the rapids-tools-1524 branch February 6, 2025 14:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use a stub to store Spark StageInfo #1525

Use a stub to store Spark StageInfo #1525

amahussein commented Feb 4, 2025

parthosa left a comment

amahussein left a comment

parthosa left a comment

sayedbilalbari Feb 5, 2025 •

edited

Loading

amahussein Feb 5, 2025

amahussein left a comment

amahussein Feb 5, 2025

cindyyuanjiang left a comment

cindyyuanjiang Feb 6, 2025

Use a stub to store Spark StageInfo #1525

Use a stub to store Spark StageInfo #1525

Conversation

amahussein commented Feb 4, 2025

Memory evaluation:

Code Changes

Main core changes

Build and Configuration Updates:

Runtime Checkpointing Feature:

Integration of Checkpoints:

parthosa left a comment

Choose a reason for hiding this comment

amahussein left a comment

Choose a reason for hiding this comment

parthosa left a comment

Choose a reason for hiding this comment

sayedbilalbari Feb 5, 2025 • edited Loading

Choose a reason for hiding this comment

amahussein Feb 5, 2025

Choose a reason for hiding this comment

amahussein left a comment

Choose a reason for hiding this comment

amahussein Feb 5, 2025

Choose a reason for hiding this comment

cindyyuanjiang left a comment

Choose a reason for hiding this comment

cindyyuanjiang Feb 6, 2025

Choose a reason for hiding this comment

sayedbilalbari Feb 5, 2025 •

edited

Loading