-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix (scaling/standalone): better switch from runtime stats to param #1099
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The side effect of this happens in the case the user would want to switch multiple times between training/evaluation mode very early on in the training process.
Could the accuracy difference in the LLM tests be caused by this? I think the tests run at some very small seqlen
(2?)
Otherwise, LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check comment about self.init_done
, otherwise LGTM!
@@ -375,6 +375,12 @@ def __init__( | |||
self.restrict_scaling_pre = restrict_scaling_impl.restrict_init_module() | |||
self.restrict_threshold_pre = restrict_threshold_impl.restrict_init_module() | |||
|
|||
def init_scale(self): | |||
if self.counter <= self.collect_stats_steps: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This avoids double re-init, since we're modifying in place the value of value.
Reason for this PR
Currently, if we switch from training to eval before
stats_collection_steps
is done, we never update thevalue
parameter to store the buffer value. This has a few side effects:value
, causing some of the issues mentioned above.Changes Made in this PR
At eval time, during the first iteration the buffer is always converted to param.The side effect of this happens in the case the user would want to switch multiple times between training/evaluation mode very early on in the training process. Although it is common to switch between training/eval to check loss on the validation set, it is usually done after enough iteration that the buffer has already been converted to parameter anyway.
I'd admit that it could be marked as breaking change for this edge cases.
This has been removed in a more recent commit. I believe there are no more breaking changes at this point.All fixed, no more breaking changes.
After calibration, we forcefully convert the buffer to parameters.
Testing Summary
Risk Highlight
Checklist
dev
branch.