-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A first sample version of FloatQuant
#159
base: feature/float_quant
Are you sure you want to change the base?
A first sample version of FloatQuant
#159
Conversation
…on can be found in the `Examples`. `±inf` are clipped to `±max_val`. `±NaN` are mapped to `±NaN`. The zero is always representable. I tested with subnormals (to be intended as subnormals for the output representation) and the quantizer represented the subnormals with no loss (I didn't extensively tested this part though). I tested the function against Brevitas `FloatQuant` implementation: they do not always match. For example I think `0.3125` should be representable (`x == xq`) by a float quantizer with 4bits for mantissa, 4bits for the exponent, 0 bias and 1bit for the sign. Brevitas `FloatQuant` implementation quantize it to `0.25`. Not sure what I should consider correct for this case.
Brevitas developer here, thanks for this concrete example - I will look into it ASAP! |
Hi @nickfraser |
Yes, please do - if you can provide minimal examples as well, this will make it much easier for us as well 🙏. Note, If you have proposed solutions, feel free to make PRs as well (pointing to |
Co-authored-by: Nicolo Ghielmetti <[email protected]>
… provided. Some other tests have been added
exponent_bias=None, | ||
max_val=None, | ||
rounding_mode="ROUND", | ||
lt_subnorm_to_zero=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maltanar , this name is terrible! Please, help me in finding a better one 🤦♂️
Please, consider that the function should be tested more extensively.
Sample
FloatQuant
function implemented. A sample use of the function can be found in theExamples
.±inf
are clipped to±max_val
.±NaN
are mapped toNaN
. The zero is always representable. I tested with subnormals (to be intended as subnormals for the output representation) and the quantizer represented the subnormals with no loss (I didn't extensively tested this part though). I tested the function against BrevitasFloatQuant
implementation: they do not always match. For example I think0.3125
should be representable (x == xq
) by a float quantizer with 4bits for mantissa, 4bits for the exponent, 0 bias and 1bit for the sign. BrevitasFloatQuant
implementation quantize it to0.25
. Not sure what I should consider correct for this case.