Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streamlining of Scaled Dot-Product Attention #12

Merged
merged 20 commits into from
Feb 6, 2025

Conversation

iksnagreb
Copy link

@iksnagreb iksnagreb commented Jan 20, 2025

iksnagreb added 17 commits April 3, 2024 15:12
Flips the order of AbsorbSignBiasIntoMultiThreshold and
MoveScalarLinearPastInvariants streamlining transforms to prefer
absorbing adds into multi-thresholds instead of propagating them
downwards. This should prevent accumulation of scalar adds in front of
two-input matmuls in scaled dot-product attention operators (they cannot
be moved past the matmul operation in that case).
The MoveScalarMulPastMatMul transformation can now handle matmul
operations with both inputs preceded by a scalar multiplication.

This change is required for streamlining scaled dot-product attention
operations, which are essentially two-input matmuls.
Assertions are to restrictive, causing the program to terminate in cases
the streamlining simply encounters nodes to which the transforms are not
applicable: Just skip those nodes.

Only the two transforms currently affecting the streamlining of scaled
dot-product attention have been changed.
This is pretty much copy and paste of the existing test case, just
replacing the MatMul initializer by a second top-input followed by a
scalar Mul.
Folding quantized initializers into add-like nodes did not repsect the
order of inputs to the add node correctly. This is fixed by testing for
one of the two possible orders and selecting the following indices
accordingly.

Shape inference following the transformation is fixed by deleting the
annotations instead of propagating them incorrectly. Deleting the shape
annotations should not hurt, as these are redone by running shape
inference after each transformation anyways.
Add is commutative and thus the export does not always generate the
initializer as the second input. However, this was always assumed by
this transformation, failing via assertion if the inputs were simply
ordered differently. The transformation now handles both of the two
possible input orderings.
This is required for streamlining packed input projections of multi-head
scaled dot-product attention. Adds support for Squeeze and Unsqueeze as
well. Skip moving of fork-node producers as this is not handled
correctly. However, the same effect can be attained by applying the
MoveLinearPastFork transformation first.
Explicitly rejects absorbing into fork-nodes. Previously, this probably
would have failed, silently resulting in a wrong model. Not sure whether
this happened in any practically relevant models?
This probably is still rather sketchy, but at least it tries to check
the data layout annotation. For now seems to be enough for getting the
thresholds of multi-head attention right, IF qonnx properly annotates
the 3D layouts.
@iksnagreb iksnagreb self-assigned this Jan 28, 2025
@iksnagreb iksnagreb requested a review from fpjentzsch January 28, 2025 20:11
@iksnagreb iksnagreb marked this pull request as ready for review January 28, 2025 20:13
@fpjentzsch
Copy link

@iksnagreb Could you fix the conflicts (especially the one in qonnx_activation_handlers.py) by merging dev into this?

@iksnagreb
Copy link
Author

Conflicts should be resolved now

@iksnagreb iksnagreb merged commit 1e3085f into dev Feb 6, 2025
0 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants