For full details about the model, see the paper Adaptive Attention Span for Transformers.
To reproduce our baselines you need to follow these instructions:
Set up your environment:
- Install PyTorch version 1.2+
cd adaptive-span
Download and extract data from the v1.0 release of mukayese-datasets:
mv /your/download/path/trnews-64.zip data/enwik8/
cd data/enwik8/
unzip trnews-64.zip
mv trnews-64.test.raw test.txt
mv trnews-64.train.raw train.txt
mv trnews-64.val.raw valid.txt
python prep_enwik8.py
cd ../../
Run training and evaluation:
./experiments/trnews-64.sh
This will take around ~72 hours on an instance with one V100 GPU.
Number of parameters: ~38.46M
val: 1.020bpc
test: 1.024bpc
Download and extract data from the v1.0 release of mukayese-datasets:
mv /your/download/path/trwiki-67.zip data/wikitext-103/
cd data/wikitext-103/
unzip trwiki-67.zip
rm trwiki-67.zip
mv trwiki-67.train.tokens train.txt
mv trwiki-67.valid.tokens valid.txt
mv trwiki-67.test.tokens test.txt
cd ../../
Run training and evaluation
./experiments/trwiki-67.sh
This will take around ~72 hours on an instance with one V100 GPU.
Number of parameters: ~92.31M
val: 15.09ppl
test: 14.64ppl