Transformer in FINN: Scaled Dot-Product Attention #13

iksnagreb · 2025-01-20T15:19:29Z

Adds support for multi-head scaled dot-product attention, i.e., the core operation of a Transformer, to FINN. This includes compiler integration of hardware operators for the attention mechanism and multi-head splitting/and merging as well as related graph transformations. Heavily depends on the related streamlining of scaled dot product attention: #12

Add attention-hlslib dependency to fetch-repos.sh, see https://github.com/iksnagreb/attention-hlslib
Figure out how to integrate the Brevitas modifications...
There are probably some undocumented fixes/modification lying around on some other branches...

To support a complete Transformer, the following PRs must be meged:

WIP: Merge branch for testing the integration of all the Transformer related PRs until they are fully merged into dev: https://github.com/eki-project/finn-plus/tree/transformer

Currently this is not a HLSCustomOp, but a QONNX CustomOp. Implemented are first operator attributes, ONNX graph/model construction and a rather improvised python mode node execution for debugging.

This causes the C++ simulation to fail as multithreshold activations are not implemented on the HLS side yet.

Note: The threshold parameters are generated and included but not connected to the attention operator yet. The attention operator uses uninitialized thresholds of the same type and shape.

Note: Currently there is no method for optimizing the accumulator width of both, the HLSCustomOp and the python simulation. Thus, to make the tests pass, both must be specified manually to the maximum possible accumulator bitwidth. Doing the MinimizeAccumulatorWidth transform would cause the HLS and python operator behavior to diverge.

Note: This is currently not controlling the memory used by the internal threshold operations and also not controlling the resoruce type used for implementing the floating-point operations within the softmax. These are all still handled by the tools' automatic strategy.

This is a temporary solution to get at least node-by-node RTL simulation of models working by simply skipping the attention operator.

The inferred shape is not taken from the model graph but from the node attributes specifying the shape.

Instead of manually squeezing all shapes, explicit Squeeze and Unsqueeze operations are inserted into the graph before deleting and redoing all shape annotations from scratch. This should be more robust and keeps the interface (data layout) the model exposes to the outside. Wraps Im2Col operations in Unsqueeze-Squeeze operators to shield it from squeezing as Im2Col always operates on 4-dimensional layouts.

…-part-map Add V80 to Alveo part_map

iksnagreb added 30 commits April 3, 2024 15:21

Start sketching out the scaled dot-product attention custom op

9e7a475

Currently this is not a HLSCustomOp, but a QONNX CustomOp. Implemented are first operator attributes, ONNX graph/model construction and a rather improvised python mode node execution for debugging.

[Attention] Add __init__ method to custom op

7f97332

[Attention] Add datatype and shape queries to custom op

e77ad2b

[Attention] Add stream/bit-width queries to custom op

c95b397

[Attention] Add refactored node attributes matching HLS op template

4a0e98e

[Attention] Adapt the custom op to the new folding concept

c3ea73e

[Attention] Fix get_ap_int_max_w output and mask stream width

602f1ca

[Attention] Start filling some of the HLSCustomOp abstract methods

ad17b1b

[Attention] Fill out includes and defines for C++ code generation

0de1bce

[Attention] Add IP generation C++ source generation step to test

de9dc73

[Attention] Add some interface pragmas for C++ code generation

f21a47c

[Attention] Add stream declarations for C++ simulation code generation

8e94cfe

[Attention] Add attention function body to C++ code generation

03ddfb2

[Attention] Add C++ simulation code feeding the input streams from files

295ab25

[Attention] Add C++ simulation code saving the output stream to file

b6a26e1

[Attention] Add missing "" to generated C++ strings

5d800e7

[Attention] Add missing bit width cases to get_ap_int_max_w

906a8c5

Some clean up and "# noqa" to calm the IDE

acaa9b2

[Attention] Get C++ simulation to compile and prepare inputs

b41575d

[Attention] Move dummy model wrapper construction out of custom op

a718bf6

[Attention] Refactor the cppsim unit test using thresholds in python sim

189a415

This causes the C++ simulation to fail as multithreshold activations are not implemented on the HLS side yet.

[Attention] Switch to the HLS function-call operator style

b00c64a

[Attention] Refactor towards thresholds HLS code generation

094f920

[Attention] Generate HLS code for all three activation thresholds

5d2836a

Note: The threshold parameters are generated and included but not connected to the attention operator yet. The attention operator uses uninitialized thresholds of the same type and shape.

[Attention] Initialize the attention operator using generated thresholds

76e5e0e

[Attention] Numpy softmax matching overflow behavior of the HLS operator

b152f23

[Attention] Satisfy attention output type constraint

8bd5a20

[Attention] Increase test bitwidth to see some more interesting behavior

65de26d

[Attention] Remove python mode node execution

ce1e19b

iksnagreb and others added 22 commits April 26, 2024 15:54

[Streamline] Fix eager access to potentially empty successors list

15246c8

[Attention] Implement get_exp_cycles for attention-related HWCustomOps

5eda0f6

Add support for ReplicateStream_hls as a PE-operation to SetFolding

0b00f69

[Attention] Add method to get the number of folded inputs

7fba682

[Attention] Make use of resource type attributes for embedded thresholds

174c098

[Attention] Add resource attribute for the attention mask in const mode

4f7072b

[Attention] Refactor RAM_STYLES dictionary

2b9d94b

[Attention] Redirect RTL simulation of attention to Python execution

b5bd0ff

This is a temporary solution to get at least node-by-node RTL simulation of models working by simply skipping the attention operator.

[Attention] Add missing constant mask mode to input shape query

aa742c7

[Attention] Fix Resource::URAM typo

2bf164a

[Attention] Add data layout checks to InferMultiHeads transformation

95f29b0

Fix SplitMultiHeads shape inference is shape is None

ca6cc33

The inferred shape is not taken from the model graph but from the node attributes specifying the shape.

[Streamline] Allow RemoveIdentityReshape for fork-nodes

9f90cce

[Streamline] Prevent MoveTransposePastEltwise from transposing scalars

5548b49

Add V80 to Alveo part_map

28255c3

add V80 similar to other Versal Parts

d2e89df

add corrected spacing

ba0261f

[Builder] Relax requirements to derive fpga part for specific board

65a83b2

Merge pull request Xilinx#1262 from jsmonson/feature/add-alveo-v80-to…

88e207e

…-part-map Add V80 to Alveo part_map

Merge remote-tracking branch 'xilinx/dev' into feature/attention

a98f594

iksnagreb requested review from DeepCowProductions, fpjentzsch and bwintermann January 20, 2025 16:43

iksnagreb added 4 commits January 20, 2025 17:49

[Deps] Add attention-hlslib dependency to fetch-repos.sh

15963e0

Make Squeeze interact properly with Im2Col, Split and initializers

6d56c61

[Streamline] Fix MoveTransposePastEltwise permutation

50544ef

[Deps] Update attention-hlslib dependency

6cee1ec

iksnagreb self-assigned this Jan 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transformer in FINN: Scaled Dot-Product Attention #13

Transformer in FINN: Scaled Dot-Product Attention #13

iksnagreb commented Jan 20, 2025 •

edited by fpjentzsch

Loading

Transformer in FINN: Scaled Dot-Product Attention #13

Are you sure you want to change the base?

Transformer in FINN: Scaled Dot-Product Attention #13

Conversation

iksnagreb commented Jan 20, 2025 • edited by fpjentzsch Loading

iksnagreb commented Jan 20, 2025 •

edited by fpjentzsch

Loading