Regular softmax attention in last block #12

buchholzmd · 2023-12-05T18:13:13Z

Hello,

First, nice job on the work, I think this is a really interesting paper, with a lot of potential to enable further theoretical investigations into deep attention mechanisms.

Looking into the code, I noticed that in the final block of SOFT there is a normal softmax attention layer. Is there a reason for this? Also did you notice any quantitative or qualitative differences in the attention heatmaps produced by this regular softmax layer and the other approximated, softmax-free attention layers?

Thanks in advance for your time and work

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regular softmax attention in last block #12

Regular softmax attention in last block #12

buchholzmd commented Dec 5, 2023 •

edited

Loading

Regular softmax attention in last block #12

Regular softmax attention in last block #12

Comments

buchholzmd commented Dec 5, 2023 • edited Loading

buchholzmd commented Dec 5, 2023 •

edited

Loading