Add support for configurable qualx label column #1528

leewyang · 2025-02-05T17:59:45Z

This PR adds support for a configurable label column for the qualx xgboost model.

The default value is Duration (wall-clock), which is the current behavior. Note that we subsequently derive Duration_speedup (from the CPU and GPU Duration values) which becomes the actual label for the model.

This adds support for a new label: duration_sum (sum of task durations).

This does not include any pre-trained models with the duration_sum target, since this is mostly intended for custom model use-cases where Duration might be insufficient or undesired.

Changes:

Adds a new config.py module to host configurable variables, including a new QUALX_LABEL env variable to define which label to use.
Adds support for duration_sum target, e.g. removing label columns from features, recomputing appDuration, etc.
Fixes minor bugs when encountering empty dataframes.

Test

I have confirmed that the models produced before and after this commit are identical (when using the default Duration).

Following CMDs have been tested:
spark_rapids prediction

Internal Usage:
python qualx_main.py preprocess
python qualx_main.py train
python qualx_main.py predict
python qualx_main.py evaluate

Signed-off-by: Lee Yang <[email protected]>

parthosa

Thanks @leewyang. LGTME. Could get additional review from @eordentlich

amahussein

Thanks @leewyang !
Few comment related to the env variables naming and usage.

amahussein · 2025-02-06T14:40:04Z

user_tools/src/spark_rapids_tools/tools/qualx/config.py

+Environment variables:
+- QUALX_CACHE_DIR: cache directory for saving Profiler output.
+- QUALX_DATA_DIR: data directory containing eventlogs, primarily used in dataset JSON files.
+- QUALX_DIR: root directory for Qualx execution, primarily used in dataset JSON files to locate
+    dataset-specific plugins.
+- QUALX_LABEL: targeted label column for XGBoost model.
+- SPARK_RAPIDS_TOOLS_JAR: path to Spark RAPIDS Tools JAR file.


In the user-tools wrapper we used a pattern across all environment variables RAPIDS_USER_TOOLS_*. Shall we apply the same concept for QualX related ones?

For, QUALX_CACHE_DIR: there is cache-directory used by the tools wrapper. Can we use the same value for both to reduce the number of variables needed by the tools? the tools uses env variable RAPIDS_USER_TOOLS_CACHE_FOLDER and it has default variable to /var/tmp/spark_rapids_user_tools_cache.

@amahussein I think there are a lot of scripts/tools that use these at the moment, so I'd leave renaming for another time. My hope is that this new config.py file will make it easier to refactor/rename in the future (while keeping changes minimal for now).

amahussein · 2025-02-06T14:43:48Z

user_tools/src/spark_rapids_tools/tools/qualx/config.py

+def get_cache_dir() -> str:
+    """Get cache directory to save Profiler output."""
+    return os.environ.get('QUALX_CACHE_DIR', 'qualx_cache')


We can use the utility methods to get/set the env variables.
For RAPIDS_USER_TOOLS environments, it will take care of adding the prefix.

spark-rapids-tools/user_tools/src/spark_rapids_pytools/common/utilities.py

Lines 103 to 118 in 14255f4

@classmethod

def find_full_rapids_tools_env_key(cls, actual_key: str) -> str:

return f'RAPIDS_USER_TOOLS_{actual_key}'

@classmethod

def get_sys_env_var(cls, k: str, def_val=None) -> Optional[str]:

return os.environ.get(k, def_val)

@classmethod

def get_rapids_tools_env(cls, k: str, def_val=None):

val = cls.get_sys_env_var(cls.find_full_rapids_tools_env_key(k), def_val)

return val

@classmethod

def set_rapids_tools_env(cls, k: str, val):

os.environ[cls.find_full_rapids_tools_env_key(k)] = str(val)

Same comment as above.

eordentlich · 2025-02-06T18:56:07Z

user_tools/src/spark_rapids_tools/tools/qualx/model.py

        'fraction_supported',
        'description',
    ]
    if 'split' in cpu_aug_tbl:
        select_columns.append('split')

+    if label in cpu_aug_tbl:


Should this be an error at this point if not true?

Good point, think I was just being overly-cautious here, will update the code.

Actually this code is used for non-training prediction too so label might not actually be in the table in that case.

eordentlich · 2025-02-06T19:45:08Z

user_tools/src/spark_rapids_tools/tools/qualx/model.py

+    expected_model_features.remove(label)
+    if label == 'duration_sum':
+        # for 'duration_sum' label, also remove 'duration_mean' since it's related to 'duration_sum'
+        expected_model_features.remove('duration_mean')


Keeping duration_mean could give opportunity for non-linear speedup estimate based on duration_mean. Not sure should be removed.

duration_mean is directly computed from duration_sum / numTasks_sum, so I was trying to avoid leaking any (duration_sum) label information in the training features.

That said, the true label would be the ratio of CPU/GPU duration_sum, so it might be ok to leave.

Yes the true label has GPU duration. Anything with CPU info is ok and doesn't leak. Though may be useless/unnecessary.

eordentlich · 2025-02-06T19:50:17Z

user_tools/src/spark_rapids_tools/tools/qualx/preprocess.py

@@ -448,6 +452,10 @@ def combine_tables(table_name: str) -> pd.DataFrame:
                     fallback_reason=f'Empty feature tables found after preprocessing: {empty_tables_str}')
        return pd.DataFrame()

+    if get_label() == 'duration_sum':
+        # override appDuration with sum(duration_sum) across all stages
+        app_tbl['appDuration'] = job_stage_agg_tbl['duration_sum'].astype(float).sum()


Does duration_sum only include sql ids? I guess no way to get duration for sum for non sql parts?

yes, although I'm seeing some weirdnesses w/ this number so taking another look.

Fixed this code to correctly aggregate by appId, but the duration_sum is still only derived from whatever is available in the job_stage_agg_tbl.

user_tools/src/spark_rapids_tools/tools/qualx/qualx_main.py

…e_row_with_default_speedup Signed-off-by: Lee Yang <[email protected]>

Signed-off-by: Lee Yang <[email protected]>

eordentlich · 2025-02-07T06:03:33Z

user_tools/src/spark_rapids_tools/tools/qualx/model.py

        'fraction_supported',
        'description',
    ]
    if 'split' in cpu_aug_tbl:
        select_columns.append('split')

+    if label not in cpu_aug_tbl:
+        raise ValueError(f'{label} column not found in input data')


I think my original comment on this was wrong since in the case of prediction (i.e. no training) there wouldn't be a label comment.

eordentlich · 2025-02-07T06:05:21Z

user_tools/src/spark_rapids_tools/tools/qualx/model.py

+    expected_model_features.remove(label)
+    if label == 'duration_sum':
+        # for 'duration_sum' label, also remove 'duration_mean' since it's related to 'duration_sum'
+        expected_model_features.remove('duration_mean')


Yes the true label has GPU duration. Anything with CPU info is ok and doesn't leak. Though may be useless/unnecessary.

Signed-off-by: Lee Yang <[email protected]>

eordentlich

👍

Add support for configurable qualx label column

c122dfa

Signed-off-by: Lee Yang <[email protected]>

leewyang added the user_tools Scope the wrapper module running CSP, QualX, and reports (python) label Feb 5, 2025

leewyang self-assigned this Feb 5, 2025

leewyang requested review from parthosa, amahussein and eordentlich February 5, 2025 18:00

parthosa reviewed Feb 6, 2025

View reviewed changes

amahussein reviewed Feb 6, 2025

View reviewed changes

eordentlich reviewed Feb 6, 2025

View reviewed changes

address comments; fix aggregation of duration_sum by appId; fix creat…

398ec13

…e_row_with_default_speedup Signed-off-by: Lee Yang <[email protected]>

leewyang requested a review from eordentlich February 6, 2025 23:22

leewyang added 2 commits February 6, 2025 16:17

fix pylint import order

5bda4a3

Signed-off-by: Lee Yang <[email protected]>

fix more lint errors

bbb9847

Signed-off-by: Lee Yang <[email protected]>

eordentlich reviewed Feb 7, 2025

View reviewed changes

remove unnecessary conditional; leave duration_mean as feature

ed22793

Signed-off-by: Lee Yang <[email protected]>

leewyang requested a review from eordentlich February 7, 2025 19:03

eordentlich approved these changes Feb 7, 2025

View reviewed changes

leewyang merged commit c8fedd7 into NVIDIA:dev Feb 7, 2025
13 checks passed

leewyang deleted the qualx_label_cfg branch February 7, 2025 23:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for configurable qualx label column #1528

Add support for configurable qualx label column #1528

leewyang commented Feb 5, 2025

parthosa left a comment

amahussein left a comment

amahussein Feb 6, 2025

leewyang Feb 6, 2025

amahussein Feb 6, 2025

leewyang Feb 6, 2025

eordentlich Feb 6, 2025

leewyang Feb 6, 2025

eordentlich Feb 7, 2025

eordentlich Feb 6, 2025

leewyang Feb 6, 2025

eordentlich Feb 7, 2025

eordentlich Feb 6, 2025

leewyang Feb 6, 2025

leewyang Feb 6, 2025

eordentlich Feb 7, 2025

eordentlich Feb 7, 2025

eordentlich left a comment

	@classmethod
	def find_full_rapids_tools_env_key(cls, actual_key: str) -> str:
	return f'RAPIDS_USER_TOOLS_{actual_key}'

	@classmethod
	def get_sys_env_var(cls, k: str, def_val=None) -> Optional[str]:
	return os.environ.get(k, def_val)

	@classmethod
	def get_rapids_tools_env(cls, k: str, def_val=None):
	val = cls.get_sys_env_var(cls.find_full_rapids_tools_env_key(k), def_val)
	return val

	@classmethod
	def set_rapids_tools_env(cls, k: str, val):
	os.environ[cls.find_full_rapids_tools_env_key(k)] = str(val)

Add support for configurable qualx label column #1528

Add support for configurable qualx label column #1528

Conversation

leewyang commented Feb 5, 2025

Changes:

Test

parthosa left a comment

Choose a reason for hiding this comment

amahussein left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eordentlich left a comment

Choose a reason for hiding this comment