Skip to content

Commit

Permalink
add more docs
Browse files Browse the repository at this point in the history
  • Loading branch information
lucidrains committed Nov 25, 2020
1 parent 62b7885 commit a8d6892
Show file tree
Hide file tree
Showing 2 changed files with 23 additions and 1 deletion.
24 changes: 23 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -363,6 +363,28 @@ model = TransformerWrapper(
)
```

### Attention on Attention for Image Captioning

https://arxiv.org/abs/1908.06954

This paper proposes to add a gated linear unit at the end of the attention layer, further gated by the original queries. Although this is not widely used outside of visual question / answering, I suspect it should lead to improvements after seeing the success of the feedforward GLU variant.

```python
import torch
from x_transformers import TransformerWrapper, Decoder, Encoder

model = TransformerWrapper(
num_tokens = 20000,
max_seq_len = 1024,
attn_layers = Encoder(
dim = 512,
depth = 6,
heads = 8,
attn_on_attn = True # gate output of attention layer, by queries
)
)
```

## Todo

To be explained and documented
Expand All @@ -375,7 +397,7 @@ To be explained and documented
- [x] ~~topk attention - Explicit Sparse Attention~~
- [x] ~~entmax15 instead of softmax - Adaptively Sparse Transformers~~
- [x] ~~mixing head information - Noam's Talking Heads~~
- [x] gating multi-head attention output - Attention on Attention
- [x] ~~gating multi-head attention output - Attention on Attention~~
- [x] simplified relative positional encoding bias - T5
- [x] sandwich transformer - Reordering Sublayers
- [x] encoder with downsampling and unet-like residual - Funnel Transformer
Expand Down
Binary file added images/attention-on-attention.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit a8d6892

Please sign in to comment.