Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recommend G6 instead of G5 on EMR #1523

Merged
merged 4 commits into from
Feb 5, 2025

Conversation

sayedbilalbari
Copy link
Collaborator

@sayedbilalbari sayedbilalbari commented Feb 3, 2025

Fixes #1512

Description of the changes -

  1. This PR changes the instances definition for EMR to use and recommend G5 GPUs instead of G6
  2. Databricks_AWS will still recommend g5 till the availability of G6 instances is consistent

Changes -

  1. Platform.scala - this file defines all the available platform classes. Have updated EMR and Databricks to use the correct instance definition map
  2. No difference core/memory wise between G5 and G6 instances. The only change is the type of GPU( A10 vs L4).
  3. Hence the only change is the name from the instance type definition.

Sayed Bilal Bari and others added 2 commits February 3, 2025 11:42
Copy link
Collaborator

@parthosa parthosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sayedbilalbari. Minor comment.

Copy link
Collaborator

@parthosa parthosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bilalbari. LGTME.

Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sayedbilalbari for the fix!

Is there any change needed to be done on the Python side?
@parthosa Is there a scenario where the python needs to use the G6 information to build the cluster info if the cluster-property is provided in the CLI argument?

@amahussein amahussein added the core_tools Scope the core module (scala) label Feb 4, 2025
@amahussein
Copy link
Collaborator

Platform.py - this file defines all the available platform classes. Have updated EMR and Databricks to use the correct instance definiation map

typo in the PR description Platform.py -> Platform.scala

@sayedbilalbari
Copy link
Collaborator Author

@amahussein I verified if there was a change needed on the python side. My observations and you can correct me if I missed something -

  1. Cluster recommendations on python tools side is not being used and now part of AutoTuner - Remove calculation of gpu cluster recommendation from python tool when cluster argument is passed #1278
  2. As part of this PR, gpu cluster inference was deprecated on user_tools side
  3. Also, user_tools initializes the cluster definitions from the json file, which already contains the definitions for g6 instances
  4. I verified this by checking that this was being initialized ToolsContext object which contained definition for g6.
  5. Qualification did not end up using it. But even if it is being used in the profiling, there should be no issue.

@parthosa Please can you verify.

@parthosa
Copy link
Collaborator

parthosa commented Feb 4, 2025

Is there any change needed to be done on the Python side?
@parthosa Is there a scenario where the python needs to use the G6 information to build the cluster info if the cluster-property is provided in the CLI argument?

@amahussein No changes are required on the Python side. The user provided source cluster logic is unaffected, as these changes only impact the recommended cluster.

Cluster recommendations on python tools side is not being used and now part of AutoTuner

That is correct @sayedbilalbari

Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sayedbilalbari !
LGTME!

@amahussein amahussein merged commit fcb4e94 into NVIDIA:dev Feb 5, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core_tools Scope the core module (scala)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Recommend g6 instead of g5 instances on AWS
3 participants