Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for configurable qualx label column #1528

Merged
merged 5 commits into from
Feb 7, 2025

Conversation

leewyang
Copy link
Collaborator

@leewyang leewyang commented Feb 5, 2025

This PR adds support for a configurable label column for the qualx xgboost model.

The default value is Duration (wall-clock), which is the current behavior. Note that we subsequently derive Duration_speedup (from the CPU and GPU Duration values) which becomes the actual label for the model.

This adds support for a new label: duration_sum (sum of task durations).

This does not include any pre-trained models with the duration_sum target, since this is mostly intended for custom model use-cases where Duration might be insufficient or undesired.

Changes:

  1. Adds a new config.py module to host configurable variables, including a new QUALX_LABEL env variable to define which label to use.
  2. Adds support for duration_sum target, e.g. removing label columns from features, recomputing appDuration, etc.
  3. Fixes minor bugs when encountering empty dataframes.

Test

I have confirmed that the models produced before and after this commit are identical (when using the default Duration).

Following CMDs have been tested:
spark_rapids prediction

Internal Usage:
python qualx_main.py preprocess
python qualx_main.py train
python qualx_main.py predict
python qualx_main.py evaluate

@leewyang leewyang added the user_tools Scope the wrapper module running CSP, QualX, and reports (python) label Feb 5, 2025
@leewyang leewyang self-assigned this Feb 5, 2025
Copy link
Collaborator

@parthosa parthosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @leewyang. LGTME. Could get additional review from @eordentlich

Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @leewyang !
Few comment related to the env variables naming and usage.

Comment on lines +17 to +23
Environment variables:
- QUALX_CACHE_DIR: cache directory for saving Profiler output.
- QUALX_DATA_DIR: data directory containing eventlogs, primarily used in dataset JSON files.
- QUALX_DIR: root directory for Qualx execution, primarily used in dataset JSON files to locate
dataset-specific plugins.
- QUALX_LABEL: targeted label column for XGBoost model.
- SPARK_RAPIDS_TOOLS_JAR: path to Spark RAPIDS Tools JAR file.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • In the user-tools wrapper we used a pattern across all environment variables RAPIDS_USER_TOOLS_*. Shall we apply the same concept for QualX related ones?
  • For, QUALX_CACHE_DIR: there is cache-directory used by the tools wrapper. Can we use the same value for both to reduce the number of variables needed by the tools? the tools uses env variable RAPIDS_USER_TOOLS_CACHE_FOLDER and it has default variable to /var/tmp/spark_rapids_user_tools_cache.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amahussein I think there are a lot of scripts/tools that use these at the moment, so I'd leave renaming for another time. My hope is that this new config.py file will make it easier to refactor/rename in the future (while keeping changes minimal for now).

Comment on lines +28 to +30
def get_cache_dir() -> str:
"""Get cache directory to save Profiler output."""
return os.environ.get('QUALX_CACHE_DIR', 'qualx_cache')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use the utility methods to get/set the env variables.
For RAPIDS_USER_TOOLS environments, it will take care of adding the prefix.

@classmethod
def find_full_rapids_tools_env_key(cls, actual_key: str) -> str:
return f'RAPIDS_USER_TOOLS_{actual_key}'
@classmethod
def get_sys_env_var(cls, k: str, def_val=None) -> Optional[str]:
return os.environ.get(k, def_val)
@classmethod
def get_rapids_tools_env(cls, k: str, def_val=None):
val = cls.get_sys_env_var(cls.find_full_rapids_tools_env_key(k), def_val)
return val
@classmethod
def set_rapids_tools_env(cls, k: str, val):
os.environ[cls.find_full_rapids_tools_env_key(k)] = str(val)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above.

'fraction_supported',
'description',
]
if 'split' in cpu_aug_tbl:
select_columns.append('split')

if label in cpu_aug_tbl:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be an error at this point if not true?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, think I was just being overly-cautious here, will update the code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this code is used for non-training prediction too so label might not actually be in the table in that case.

expected_model_features.remove(label)
if label == 'duration_sum':
# for 'duration_sum' label, also remove 'duration_mean' since it's related to 'duration_sum'
expected_model_features.remove('duration_mean')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping duration_mean could give opportunity for non-linear speedup estimate based on duration_mean. Not sure should be removed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duration_mean is directly computed from duration_sum / numTasks_sum, so I was trying to avoid leaking any (duration_sum) label information in the training features.

That said, the true label would be the ratio of CPU/GPU duration_sum, so it might be ok to leave.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the true label has GPU duration. Anything with CPU info is ok and doesn't leak. Though may be useless/unnecessary.

@@ -448,6 +452,10 @@ def combine_tables(table_name: str) -> pd.DataFrame:
fallback_reason=f'Empty feature tables found after preprocessing: {empty_tables_str}')
return pd.DataFrame()

if get_label() == 'duration_sum':
# override appDuration with sum(duration_sum) across all stages
app_tbl['appDuration'] = job_stage_agg_tbl['duration_sum'].astype(float).sum()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does duration_sum only include sql ids? I guess no way to get duration for sum for non sql parts?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, although I'm seeing some weirdnesses w/ this number so taking another look.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this code to correctly aggregate by appId, but the duration_sum is still only derived from whatever is available in the job_stage_agg_tbl.

@leewyang leewyang requested a review from eordentlich February 6, 2025 23:22
'fraction_supported',
'description',
]
if 'split' in cpu_aug_tbl:
select_columns.append('split')

if label not in cpu_aug_tbl:
raise ValueError(f'{label} column not found in input data')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my original comment on this was wrong since in the case of prediction (i.e. no training) there wouldn't be a label comment.

expected_model_features.remove(label)
if label == 'duration_sum':
# for 'duration_sum' label, also remove 'duration_mean' since it's related to 'duration_sum'
expected_model_features.remove('duration_mean')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the true label has GPU duration. Anything with CPU info is ok and doesn't leak. Though may be useless/unnecessary.

@leewyang leewyang requested a review from eordentlich February 7, 2025 19:03
Copy link
Collaborator

@eordentlich eordentlich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@leewyang leewyang merged commit c8fedd7 into NVIDIA:dev Feb 7, 2025
13 checks passed
@leewyang leewyang deleted the qualx_label_cfg branch February 7, 2025 23:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
user_tools Scope the wrapper module running CSP, QualX, and reports (python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants