Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix (scaling/standalone): better switch from runtime stats to param #1099

Merged
merged 9 commits into from
Jan 7, 2025

Conversation

Giuseppe5
Copy link
Collaborator

@Giuseppe5 Giuseppe5 commented Nov 21, 2024

Reason for this PR

Currently, if we switch from training to eval before stats_collection_steps is done, we never update the value parameter to store the buffer value. This has a few side effects:

  • When applying learned round, we might keep the model in eval model but still accumulate gradients. If the value parameter is not being used, no gradients are accumulated
  • When exporting state_dict, value is not exported
  • When doing PTQ calibration, current setup is such that the buffer is never converted to its corresponding parameter value, causing some of the issues mentioned above.

Changes Made in this PR

At eval time, during the first iteration the buffer is always converted to param.
The side effect of this happens in the case the user would want to switch multiple times between training/evaluation mode very early on in the training process. Although it is common to switch between training/eval to check loss on the validation set, it is usually done after enough iteration that the buffer has already been converted to parameter anyway.
I'd admit that it could be marked as breaking change for this edge cases.

This has been removed in a more recent commit. I believe there are no more breaking changes at this point.

All fixed, no more breaking changes.
After calibration, we forcefully convert the buffer to parameters.

Testing Summary

Risk Highlight

  • This PR includes code from another work (please detail).
  • This PR contains API-breaking changes.
  • This PR depends on work in another PR (please provide links/details).
  • This PR introduces new dependencies (please detail).
  • There are coverage gaps not covered by tests.
  • Documentation updates required in subsequent PR.

Checklist

  • Code comments added to any hard-to-understand areas, if applicable.
  • Changes generate no new warnings.
  • Updated any relevant tests, if applicable.
  • No conflicts with destination dev branch.
  • I reviewed my own code changes.
  • Initial CI/CD passing.
  • 1+ reviews given, and any review issues addressed and approved.
  • Post-review full CI/CD passing.

@Giuseppe5 Giuseppe5 requested review from nickfraser and removed request for nickfraser November 21, 2024 14:57
Copy link
Collaborator

@nickfraser nickfraser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The side effect of this happens in the case the user would want to switch multiple times between training/evaluation mode very early on in the training process.

Could the accuracy difference in the LLM tests be caused by this? I think the tests run at some very small seqlen (2?)

Otherwise, LGTM!

@Giuseppe5 Giuseppe5 requested a review from nickfraser December 16, 2024 17:04
Copy link
Collaborator

@nickfraser nickfraser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check comment about self.init_done, otherwise LGTM!

src/brevitas/core/scaling/standalone.py Outdated Show resolved Hide resolved
@Giuseppe5 Giuseppe5 requested a review from nickfraser January 6, 2025 12:39
@@ -375,6 +375,12 @@ def __init__(
self.restrict_scaling_pre = restrict_scaling_impl.restrict_init_module()
self.restrict_threshold_pre = restrict_threshold_impl.restrict_init_module()

def init_scale(self):
if self.counter <= self.collect_stats_steps:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This avoids double re-init, since we're modifying in place the value of value.

@Giuseppe5 Giuseppe5 requested review from nickfraser and removed request for nickfraser January 6, 2025 17:45
@Giuseppe5 Giuseppe5 merged commit 726ea3c into Xilinx:dev Jan 7, 2025
384 of 396 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants