Fix (scaling/standalone): better switch from runtime stats to param #1099

Giuseppe5 · 2024-11-21T13:58:59Z

Reason for this PR

Currently, if we switch from training to eval before stats_collection_steps is done, we never update the value parameter to store the buffer value. This has a few side effects:

When applying learned round, we might keep the model in eval model but still accumulate gradients. If the value parameter is not being used, no gradients are accumulated
When exporting state_dict, value is not exported
When doing PTQ calibration, current setup is such that the buffer is never converted to its corresponding parameter value, causing some of the issues mentioned above.

Changes Made in this PR

At eval time, during the first iteration the buffer is always converted to param.
The side effect of this happens in the case the user would want to switch multiple times between training/evaluation mode very early on in the training process. Although it is common to switch between training/eval to check loss on the validation set, it is usually done after enough iteration that the buffer has already been converted to parameter anyway.
I'd admit that it could be marked as breaking change for this edge cases.

~~This has been removed in a more recent commit. I believe there are no more breaking changes at this point.~~

All fixed, no more breaking changes.
After calibration, we forcefully convert the buffer to parameters.

Testing Summary

Risk Highlight

This PR includes code from another work (please detail).
This PR contains API-breaking changes.
This PR depends on work in another PR (please provide links/details).
This PR introduces new dependencies (please detail).
There are coverage gaps not covered by tests.
Documentation updates required in subsequent PR.

Checklist

Code comments added to any hard-to-understand areas, if applicable.
Changes generate no new warnings.
Updated any relevant tests, if applicable.
No conflicts with destination dev branch.
I reviewed my own code changes.
Initial CI/CD passing.
1+ reviews given, and any review issues addressed and approved.
Post-review full CI/CD passing.

nickfraser

The side effect of this happens in the case the user would want to switch multiple times between training/evaluation mode very early on in the training process.

Could the accuracy difference in the LLM tests be caused by this? I think the tests run at some very small seqlen (2?)

Otherwise, LGTM!

nickfraser

Check comment about self.init_done, otherwise LGTM!

src/brevitas/core/scaling/standalone.py

Giuseppe5 · 2025-01-06T12:44:11Z

src/brevitas/core/scaling/standalone.py

@@ -375,6 +375,12 @@ def __init__(
        self.restrict_scaling_pre = restrict_scaling_impl.restrict_init_module()
        self.restrict_threshold_pre = restrict_threshold_impl.restrict_init_module()

+    def init_scale(self):
+        if self.counter <= self.collect_stats_steps:


This avoids double re-init, since we're modifying in place the value of value.

Giuseppe5 requested review from nickfraser and removed request for nickfraser November 21, 2024 14:57

nickfraser approved these changes Dec 9, 2024

View reviewed changes

Giuseppe5 added 4 commits December 16, 2024 09:01

Fix (scaling/standalone): better switch from runtime stats to param

9d8e6d2

fix

c351580

Fix

6327d5c

Fix tests

c6fb3d6

Giuseppe5 force-pushed the fix_calib branch from 6ffb64f to c6fb3d6 Compare December 16, 2024 09:01

Fix

378bd8c

Giuseppe5 requested a review from nickfraser December 16, 2024 17:04

nickfraser approved these changes Dec 17, 2024

View reviewed changes

src/brevitas/core/scaling/standalone.py Outdated Show resolved Hide resolved

Giuseppe5 added 3 commits December 17, 2024 15:09

Update standalone.py

6175fa3

Update test_torch_qcdq.py

9146d5b

fix for multiple inits

b45f298

Giuseppe5 requested a review from nickfraser January 6, 2025 12:39

Giuseppe5 commented Jan 6, 2025

View reviewed changes

Update calibrate.py

471d09f

Giuseppe5 requested review from nickfraser and removed request for nickfraser January 6, 2025 17:45

Giuseppe5 merged commit 726ea3c into Xilinx:dev Jan 7, 2025
384 of 396 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix (scaling/standalone): better switch from runtime stats to param #1099

Fix (scaling/standalone): better switch from runtime stats to param #1099

Giuseppe5 commented Nov 21, 2024 •

edited

Loading

nickfraser left a comment •

edited

Loading

nickfraser left a comment

Giuseppe5 Jan 6, 2025

Fix (scaling/standalone): better switch from runtime stats to param #1099

Fix (scaling/standalone): better switch from runtime stats to param #1099

Conversation

Giuseppe5 commented Nov 21, 2024 • edited Loading

Reason for this PR

Changes Made in this PR

Testing Summary

Risk Highlight

Checklist

nickfraser left a comment • edited Loading

Choose a reason for hiding this comment

nickfraser left a comment

Choose a reason for hiding this comment

Giuseppe5 Jan 6, 2025

Choose a reason for hiding this comment

Giuseppe5 commented Nov 21, 2024 •

edited

Loading

nickfraser left a comment •

edited

Loading