-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for configurable qualx label column #1528
Conversation
Signed-off-by: Lee Yang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @leewyang. LGTME. Could get additional review from @eordentlich
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @leewyang !
Few comment related to the env variables naming and usage.
Environment variables: | ||
- QUALX_CACHE_DIR: cache directory for saving Profiler output. | ||
- QUALX_DATA_DIR: data directory containing eventlogs, primarily used in dataset JSON files. | ||
- QUALX_DIR: root directory for Qualx execution, primarily used in dataset JSON files to locate | ||
dataset-specific plugins. | ||
- QUALX_LABEL: targeted label column for XGBoost model. | ||
- SPARK_RAPIDS_TOOLS_JAR: path to Spark RAPIDS Tools JAR file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- In the user-tools wrapper we used a pattern across all environment variables
RAPIDS_USER_TOOLS_*
. Shall we apply the same concept for QualX related ones? - For,
QUALX_CACHE_DIR
: there is cache-directory used by the tools wrapper. Can we use the same value for both to reduce the number of variables needed by the tools? the tools uses env variableRAPIDS_USER_TOOLS_CACHE_FOLDER
and it has default variable to/var/tmp/spark_rapids_user_tools_cache
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amahussein I think there are a lot of scripts/tools that use these at the moment, so I'd leave renaming for another time. My hope is that this new config.py
file will make it easier to refactor/rename in the future (while keeping changes minimal for now).
def get_cache_dir() -> str: | ||
"""Get cache directory to save Profiler output.""" | ||
return os.environ.get('QUALX_CACHE_DIR', 'qualx_cache') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use the utility methods to get/set the env variables.
For RAPIDS_USER_TOOLS environments, it will take care of adding the prefix.
spark-rapids-tools/user_tools/src/spark_rapids_pytools/common/utilities.py
Lines 103 to 118 in 14255f4
@classmethod | |
def find_full_rapids_tools_env_key(cls, actual_key: str) -> str: | |
return f'RAPIDS_USER_TOOLS_{actual_key}' | |
@classmethod | |
def get_sys_env_var(cls, k: str, def_val=None) -> Optional[str]: | |
return os.environ.get(k, def_val) | |
@classmethod | |
def get_rapids_tools_env(cls, k: str, def_val=None): | |
val = cls.get_sys_env_var(cls.find_full_rapids_tools_env_key(k), def_val) | |
return val | |
@classmethod | |
def set_rapids_tools_env(cls, k: str, val): | |
os.environ[cls.find_full_rapids_tools_env_key(k)] = str(val) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above.
'fraction_supported', | ||
'description', | ||
] | ||
if 'split' in cpu_aug_tbl: | ||
select_columns.append('split') | ||
|
||
if label in cpu_aug_tbl: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be an error at this point if not true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, think I was just being overly-cautious here, will update the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this code is used for non-training prediction too so label might not actually be in the table in that case.
expected_model_features.remove(label) | ||
if label == 'duration_sum': | ||
# for 'duration_sum' label, also remove 'duration_mean' since it's related to 'duration_sum' | ||
expected_model_features.remove('duration_mean') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeping duration_mean could give opportunity for non-linear speedup estimate based on duration_mean. Not sure should be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duration_mean is directly computed from duration_sum / numTasks_sum, so I was trying to avoid leaking any (duration_sum) label information in the training features.
That said, the true label would be the ratio of CPU/GPU duration_sum, so it might be ok to leave.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the true label has GPU duration. Anything with CPU info is ok and doesn't leak. Though may be useless/unnecessary.
@@ -448,6 +452,10 @@ def combine_tables(table_name: str) -> pd.DataFrame: | |||
fallback_reason=f'Empty feature tables found after preprocessing: {empty_tables_str}') | |||
return pd.DataFrame() | |||
|
|||
if get_label() == 'duration_sum': | |||
# override appDuration with sum(duration_sum) across all stages | |||
app_tbl['appDuration'] = job_stage_agg_tbl['duration_sum'].astype(float).sum() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does duration_sum only include sql ids? I guess no way to get duration for sum for non sql parts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, although I'm seeing some weirdnesses w/ this number so taking another look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed this code to correctly aggregate by appId, but the duration_sum is still only derived from whatever is available in the job_stage_agg_tbl.
…e_row_with_default_speedup Signed-off-by: Lee Yang <[email protected]>
Signed-off-by: Lee Yang <[email protected]>
Signed-off-by: Lee Yang <[email protected]>
'fraction_supported', | ||
'description', | ||
] | ||
if 'split' in cpu_aug_tbl: | ||
select_columns.append('split') | ||
|
||
if label not in cpu_aug_tbl: | ||
raise ValueError(f'{label} column not found in input data') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my original comment on this was wrong since in the case of prediction (i.e. no training) there wouldn't be a label comment.
expected_model_features.remove(label) | ||
if label == 'duration_sum': | ||
# for 'duration_sum' label, also remove 'duration_mean' since it's related to 'duration_sum' | ||
expected_model_features.remove('duration_mean') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the true label has GPU duration. Anything with CPU info is ok and doesn't leak. Though may be useless/unnecessary.
Signed-off-by: Lee Yang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
This PR adds support for a configurable
label
column for the qualx xgboost model.The default value is
Duration
(wall-clock), which is the current behavior. Note that we subsequently deriveDuration_speedup
(from the CPU and GPUDuration
values) which becomes the actual label for the model.This adds support for a new label:
duration_sum
(sum of task durations).This does not include any pre-trained models with the
duration_sum
target, since this is mostly intended for custom model use-cases whereDuration
might be insufficient or undesired.Changes:
config.py
module to host configurable variables, including a newQUALX_LABEL
env variable to define which label to use.duration_sum
target, e.g. removing label columns from features, recomputingappDuration
, etc.Test
I have confirmed that the models produced before and after this commit are identical (when using the default
Duration
).Following CMDs have been tested:
spark_rapids prediction
Internal Usage:
python qualx_main.py preprocess
python qualx_main.py train
python qualx_main.py predict
python qualx_main.py evaluate