Even though large pre-trained multilingual models (e.g. mBERT, XLM-R) have led to significant performance gains on a wide range of cross-lingual NLP tasks, success on many downstream tasks still relies on the availability of sufficient annotated data. Traditional fine-tuning of pre-trained models using only a few target samples can cause over-fitting. This can be quite limiting as most languages in the world are under-resourced. In this work, we investigate cross-lingual adaptation using a simple nearest neighbor few-shot (<15 samples) inference technique for classification tasks. We experiment using a total of 16 distinct languages across two NLP tasks- XNLI and PAWS-X. Our approach consistently improves traditional fine-tuning using only a handful of labeled samples in target locales. We also demonstrate its generalization capability across tasks.
This repository contains the code for Nearest Neighbour Few-Shot Learning for Cross-lingual Classification.
Implementation of the Nearest Neighbor Few-Shot Learning approach and instructions on running the code will be available here, soon.
We provide a *.yml
file of our environment. Install environment by,
conda env create -f scripts/few-shot.yml
To download xnli and pawsx run,
bash scripts/download_data.sh
Model | Description | Dataset | Checkpoints |
---|---|---|---|
XLMR-R large |
Full model Finetuning with english data | XNLI | Please Contact 1/2 |
XLMR-R large |
Full model Finetuning with english data | PAWSX | Please Contact 1/2 |
Inside the project create a folder named dumped. |
mkdir -p dumped
Move the downloaded Pretrained Models
to dumped
.
mv pawsx-xlmr-baseline-fp16 dumped/
mv xnli-xlmr-baseline-fp16 dumped/
For re-producing Table 1 results,
bash scripts/exp_scripts/xnli/FewShotBenchmark/xlmr-baseline-few-shot-benchmark.sh
For re-producing Table 2 results,
bash scripts/exp_scripts/pawsx/FewShotBenchmark/xlmr-baseline-few-shot-benchmark.sh
for re-producing Table 3 results,
bash scripts/exp_scripts/pawsx/FewShotBenchmark/xlmr-baseline-cross-task-xnli-benchmark.sh
if you want to start evaluation of multiple seed models in multiple GPU, you can do that by following,
bash scripts/run.sh
You can accumulate results by,
python scripts/extract_answer.py --folder_path dumped/xnli-xlmr-baseline-cross-lingual-transfer --shot 5 --lang en es --pkl_res_file_name "task-xnli-src_lang-en-tgt_lang-{}-lr_rate-0.0000075-shot-{}-seed-{}"
python scripts/extract_answer.py --folder_path dumped/-xlmr-baseline-cross-lingual-transfer --shot 5 --lang "en" "de" "fr" --pkl_res_file_name "task-pawsx-src_lang-en-tgt_lang-{}-lr_rate-0.0000075-shot-{}-seed-{}"
python scripts/extract_answer.py --folder_path dumped/pawsx-xlmr-cross-task-fewshot-benchmark --shot 5 --lang "en" "de" "fr" --pkl_res_file_name "task-pawsx-src_lang-en-tgt_lang-{}-lr_rate-0.0000075-shot-{}-seed-{}"
See CONTRIBUTING for more information.
This project is licensed under the Apache-2.0 License.
When using this repository, please cite the following work:
@misc{bari2021nearest,
title={Nearest Neighbour Few-Shot Learning for Cross-lingual Classification},
author={M Saiful Bari and Batool Haider and Saab Mansour},
year={2021},
eprint={2109.02221},
archivePrefix={arXiv},
primaryClass={cs.CL}
}