This repository includes code and results for replicating the paper "Refusal in Language Models Is Mediated by a Single Direction". The code mostly follows the original codebase with some modifications to build on NNsight library.
All the required dependencies and packages are listed in environment.yml
. To create a virtual enviornment from it, run the following command:
conda env create --name envname --file=environment.yml
To reproduce the main results, run the following command:
python -m refusal_direction.run --model_path MODEL_PATH
Alternatively, you can create a config.yaml
file and use --config_file path/to/config.yaml
instead.
The main pipeline includes the following steps:
- Data preprocessing
- Generate candidate directions
- Select a direction
- Run and save completions on evaluation datasets with different intervention
- Evaluate cross entropy loss on harmless data
- Evaluate model coherence on general language benchmarks
You can also resume from any of the steps listed above. For example, to resume from Step 4 to evaluate model loss, you can run the command:
python -m refusal_direction.run --config_file path/to/config.yaml --resume_from_step 4
Finally, to compute the refusal scores (substring match) and safety scores (Llama Guard 2) on the generated completions (from Step 3), run the command as follows:
python -m refusal_direction.run --config_file path/to/config.yaml --run_jailbreak_eval
Note: For the model coherence evaluation, the code is not provided in the paper's GitHub repository. We currently have implemented MMLU, ARC, and TruthfulQA tasks. We observe some accuracy differences between our implementation and results reported in the original paper. There may be some implementation differences.
The evaluation in the original paper mainly focuses on evaluating the refusal direction. Additionally, we include evaluation on the magnitude of the refusal vector, which assesses how well the scalar projection on the refusal vector can reflect the degree of refusal scores in the model outputs. We perform evaluation on the harmful/harmless test split and also on a more challenging dataset, XSTest.
To run this evaluation, you can use the following command:
python -m refusal_direction.run --config_file path/to/config.yaml --run_magnitude_eval
Further analysis and results can be found in analysis/magnitude_evaluation.ipynb
.
- Refusal in language models is mediated by a single direction. (Arditi et al., 2024) [Github repo]
- NNsight library: https://nnsight.net/
- XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models (Röttger et al., NAACL 2024) [GitHub repo]