BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
Model Name | Model Type (Encoder-Decoder, etc.) | Pre-train Objective | Tokenization | Vocab Size | OOV Handling | Embeddings | Attention | Activations | Parameters | Training | Pre-Train Data | Batch Size |
---|---|---|---|---|---|---|---|---|---|---|---|---|
BART | Encoder-Decoder (Transformer) | Re-construction loss: Usual decoder cross-entropy between decoder and original document (encoder sees corrupted document); although, they look at several variants:
|
Same BPE encoding as GPT-2 | Same as GPT? Or RoBERTa? | Same as GPT? Or RoBERTa? | Same as GPT? Or RoBERTa? | Same as the original Transformer | GeLU |
|
|
160 GB of data similar to Liu et al 2019 | (for large scale experiments) batch_size=8K |
Basically, the authors set out to combine the bi-directional encoder (BERT) and the auto-regressive decoder (GPT) in one model. Their hypothesis is that since BERT is trained to predict random tokens using bi-directional information, it cannot be used easily for text generation; conversely, GPT is designed for text generation, but lacks context for understanding other tasks. To setup the problem, a noising function is used to corrupt the original text. The authors explore a few different noising functions. BART performs as well as RoBERTa on GLUE + SOTA on some other tasks. Further, a new fine-tuning technique was developed where additional layers are stacked.
Document corruption: For the encoder, the following are done to corrupt the documents:
- Token masking: Random tokens are sampled and replaced with MASK token.
- Token deletion: Random tokens are deleted.
- Token infilling: Drawing from Poisson distribution, different lengths of text are sampled and replaced with a single MASK token.
- Sentence permutation: A document is split based on sentences, then the sentences are shuffled in random order.
- Document rotation: A token is chosen uniformly, and the whole document is rotated such that the document starts at that token.
Picture says it all. BART is essentially a composition of BERT and GPT.
(from original paper)
This depiction shows the different ways the corpus is distorted for the encoding task.
(from original paper)
BART has different fine-tuning differs between classification and neural machine translation. In the latter, an additional encoder is used.
(from original paper)