You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First, nice job on the work, I think this is a really interesting paper, with a lot of potential to enable further theoretical investigations into deep attention mechanisms.
Looking into the code, I noticed that in the final block of SOFT there is a normal softmax attention layer. Is there a reason for this? Also did you notice any quantitative or qualitative differences in the attention heatmaps produced by this regular softmax layer and the other approximated, softmax-free attention layers?
Thanks in advance for your time and work
The text was updated successfully, but these errors were encountered:
Hello,
First, nice job on the work, I think this is a really interesting paper, with a lot of potential to enable further theoretical investigations into deep attention mechanisms.
Looking into the code, I noticed that in the final block of SOFT there is a normal softmax attention layer. Is there a reason for this? Also did you notice any quantitative or qualitative differences in the attention heatmaps produced by this regular softmax layer and the other approximated, softmax-free attention layers?
Thanks in advance for your time and work
The text was updated successfully, but these errors were encountered: