This is a repository for the adversarial training code of our paper https://arxiv.org/abs/2405.15589.
Our models can be found on Huggingface:
Zephyr-CAPO refused a lot of harmless requests. We will upload a new model soon with a more sensible robustness utility trade-off
- Clone this repository with
git clone [email protected]:sophie-xhonneux/Continuous-AdvTrain.git
- Install the requirements
pip install -r requirements.txt
- Create a config in
config/path
(seeexample_path.yaml
) - Run the code with
python src/run_experiments.py --config-name=adv_train_ul path=example_path
You can also run the IPO experiments by replacing adv_train_ul
with adv_train_ipo
. Moreover, hydra allows you to override any hyperparameters from the commandline (e.g. add adversarial.eps=0.075
) or you can create a new config file under the config
folder. See the paper for the exact hyperparameters.
The data is in the data folder is from the Harmbench repository (https://github.com/centerforaisafety/HarmBench
) with the exception of a couple files we created as part of this paper.
If you used this code, please cite our paper:
@misc{xhonneux2024efficient,
title={Efficient Adversarial Training in LLMs with Continuous Attacks},
author={Sophie Xhonneux and Alessandro Sordoni and Stephan Günnemann and Gauthier Gidel and Leo Schwinn},
year={2024},
eprint={2405.15589},
archivePrefix={arXiv},
primaryClass={cs.LG}
}