Skip to content

Commit

Permalink
add talking heads
Browse files Browse the repository at this point in the history
  • Loading branch information
lucidrains committed Nov 25, 2020
1 parent 0db1861 commit 62b7885
Show file tree
Hide file tree
Showing 2 changed files with 45 additions and 3 deletions.
48 changes: 45 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -316,7 +316,49 @@ model = TransformerWrapper(
dim = 512,
depth = 6,
heads = 8,
sparse_topk = 8 # keep only the top 8 values before attention (softmax)
attn_sparse_topk = 8 # keep only the top 8 values before attention (softmax)
)
)
```

Alternatively, if you would like to use `entmax15`, you can also do so with one setting as shown below.

```python
import torch
from x_transformers import TransformerWrapper, Decoder, Encoder

model = TransformerWrapper(
num_tokens = 20000,
max_seq_len = 1024,
attn_layers = Decoder(
dim = 512,
depth = 6,
heads = 8,
attn_use_entmax15 = True # use entmax15 for attention step
)
)
```

### Talking-Heads Attention

<img src="./images/talking-heads.png" width="500px"></img>

https://arxiv.org/abs/2003.02436

A Noam Shazeer paper that proposes mixing information between heads pre and post attention (softmax). This comes with the cost of extra memory and compute.

```python
import torch
from x_transformers import TransformerWrapper, Decoder, Encoder

model = TransformerWrapper(
num_tokens = 20000,
max_seq_len = 1024,
attn_layers = Decoder(
dim = 512,
depth = 6,
heads = 8,
attn_talking_heads = True # turn on information exchange between attention heads
)
)
```
Expand All @@ -331,8 +373,8 @@ To be explained and documented
- [x] ~~feedforward gated linear variant - Noam's GLU Variants~~
- [x] ~~rezero - Rezero is all you need~~
- [x] ~~topk attention - Explicit Sparse Attention~~
- [x] entmax15 instead of softmax - Adaptively Sparse Transformers
- [x] mixing head information - Noam's Talking Heads
- [x] ~~entmax15 instead of softmax - Adaptively Sparse Transformers~~
- [x] ~~mixing head information - Noam's Talking Heads~~
- [x] gating multi-head attention output - Attention on Attention
- [x] simplified relative positional encoding bias - T5
- [x] sandwich transformer - Reordering Sublayers
Expand Down
Binary file added images/talking-heads.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 62b7885

Please sign in to comment.