Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a stub to store Spark StageInfo #1525

Merged
merged 2 commits into from
Feb 6, 2025

Conversation

amahussein
Copy link
Collaborator

Signed-off-by: Ahmed Hussein (amahussein) [email protected]

Fixes #1524

This commit uses a smaller class StageInfoStub to store Spark's StageInfo. This class is common between all the spark implementations but it has more fields to the constructor across different versions.

Currently we only use a subset of the class fields. The remaining fields represent an overhead or redundant storage; especially when it comes to store the accumulables and taskMetrics for each stage.

To evaluate the memory optimization, a new Checkpoint mechanism was added to allow gathering information at separate stages of the execution.
The checkpoint design and implementation can be further improved and extended to build a performance profile to compare different tradeoffs.

This pull request introduces a new runtime checkpointing feature for performance benchmarking and debugging in the RAPIDS tools. The main changes include adding new classes for runtime checkpoints, updating build properties, and modifying existing classes to integrate the new checkpointing functionality.

Memory evaluation:

  • In order to enable the runtime injection set the build property benchmarks.checkpoints to dev. This is achieved by changing the pom.xml file or passing the property as an arhument to the mvn command.
  • This property will dump the free memory after the eventlog is processed.
  • An initial test on a sample eventlog showed that it saved approximately ~ 7MB. Of course, this resulkt is a function of the number of completed stages in the eventlog and the accumulables associated with each one.

Code Changes

Main core changes

Build and Configuration Updates:

Runtime Checkpointing Feature:

Integration of Checkpoints:

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

Fixes NVIDIA#1524

This commit uses a smaller class `StageInfoStub` to store Spark's
StageInfo. This class is common between all the spark implementations
but it has more fields to the constructor across different versions.

Currently we only use a subset of the class fields. The remaining fields
represent an overhead or redundant storage; especially when it comes to
store the accumulables and taskMetrics for each stage.

To evaluate the memory optimization, a new `Checkpoint` mechanism was
added to allow gathering information at separate stages of the
execution.
The `checkpoint` design and implementation can be further improved and
extended to build a performance profile to compare different tradeoffs.
@amahussein amahussein added core_tools Scope the core module (scala) performance performance and scalability of tools labels Feb 4, 2025
@amahussein amahussein self-assigned this Feb 4, 2025
Copy link
Collaborator

@parthosa parthosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amahussein. A minor typo. This framework for logging memory used looks great.

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>
Copy link
Collaborator Author

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @parthosa

Copy link
Collaborator

@parthosa parthosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amahussein. LGTME.

private def initStageInfo(newStageInfo: StageInfo): StageInfo = {
newStageInfo
private def initStageInfo(newStageInfo: StageInfo): StageInfoStub = {
StageInfoStub.fromStageInfo(newStageInfo)
Copy link
Collaborator

@sayedbilalbari sayedbilalbari Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amahussein Currently we are reassigning the StageInfo object and updating the StageModel class with the incoming StageInfo object.
Now that we are using a Stub and creating a new object, can we not use the existing Stub object in case of updates to StageModel and just update its variables. Currently we are doing a new Stub allocation in all the cases.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmm, yeah it is possible we do only update to some fields that get changed.
The idea that:

  • This update should only happens a single time when the stage is completed. This implies that this is not very frequent event.
  • Updating some fields could lead to bugs. When we extend this object in the future, the dev will have to make sure that they are handling the fields correctly (which one could be updated vs which one are not).
  • allocating the new object in that case made the code look easier especially to maintain moving fwd.

Copy link
Collaborator Author

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private def initStageInfo(newStageInfo: StageInfo): StageInfo = {
newStageInfo
private def initStageInfo(newStageInfo: StageInfo): StageInfoStub = {
StageInfoStub.fromStageInfo(newStageInfo)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmm, yeah it is possible we do only update to some fields that get changed.
The idea that:

  • This update should only happens a single time when the stage is completed. This implies that this is not very frequent event.
  • Updating some fields could lead to bugs. When we extend this object in the future, the dev will have to make sure that they are handling the fields correctly (which one could be updated vs which one are not).
  • allocating the new object in that case made the code look easier especially to maintain moving fwd.

Copy link
Collaborator

@cindyyuanjiang cindyyuanjiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amahussein! A very minor nit.

} else { // loads the noOp implementation by default
new NoOpRuntimeCheckpoint
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: new line between 2 defs

@amahussein amahussein merged commit 14255f4 into NVIDIA:dev Feb 6, 2025
13 checks passed
@amahussein amahussein deleted the rapids-tools-1524 branch February 6, 2025 14:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core_tools Scope the core module (scala) performance performance and scalability of tools
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Use a stub to store Spark StageInfo
4 participants