Replication of Refusal in Language Models Is Mediated by a Single Direction

This repository includes code and results for replicating the paper "Refusal in Language Models Is Mediated by a Single Direction". The code mostly follows the original codebase with some modifications to build on NNsight library.

Setup

All the required dependencies and packages are listed in environment.yml. To create a virtual enviornment from it, run the following command:

conda env create --name envname --file=environment.yml

Running the Code

To reproduce the main results, run the following command:

python -m refusal_direction.run --model_path MODEL_PATH

Alternatively, you can create a config.yaml file and use --config_file path/to/config.yaml instead.

The main pipeline includes the following steps:

Data preprocessing
Generate candidate directions
Select a direction
Run and save completions on evaluation datasets with different intervention
Evaluate cross entropy loss on harmless data
Evaluate model coherence on general language benchmarks

You can also resume from any of the steps listed above. For example, to resume from Step 4 to evaluate model loss, you can run the command:

python -m refusal_direction.run --config_file path/to/config.yaml --resume_from_step 4

Finally, to compute the refusal scores (substring match) and safety scores (Llama Guard 2) on the generated completions (from Step 3), run the command as follows:

python -m refusal_direction.run --config_file path/to/config.yaml --run_jailbreak_eval

Note: For the model coherence evaluation, the code is not provided in the paper's GitHub repository. We currently have implemented MMLU, ARC, and TruthfulQA tasks. We observe some accuracy differences between our implementation and results reported in the original paper. There may be some implementation differences.

Additional Evaluation

The evaluation in the original paper mainly focuses on evaluating the refusal direction. Additionally, we include evaluation on the magnitude of the refusal vector, which assesses how well the scalar projection on the refusal vector can reflect the degree of refusal scores in the model outputs. We perform evaluation on the harmful/harmless test split and also on a more challenging dataset, XSTest.

To run this evaluation, you can use the following command:

python -m refusal_direction.run --config_file path/to/config.yaml --run_magnitude_eval

Further analysis and results can be found in analysis/magnitude_evaluation.ipynb.

References

Refusal in language models is mediated by a single direction. (Arditi et al., 2024) [Github repo]
NNsight library: https://nnsight.net/
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models (Röttger et al., NAACL 2024) [GitHub repo]

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
analysis		analysis
refusal_direction		refusal_direction
runs		runs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Replication of Refusal in Language Models Is Mediated by a Single Direction

Setup

Running the Code

Additional Evaluation

References

About

Releases

Packages

Languages

License

hannahxchen/refusal-direction-replication

Folders and files

Latest commit

History

Repository files navigation

Replication of Refusal in Language Models Is Mediated by a Single Direction

Setup

Running the Code

Additional Evaluation

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages