Skip to content

hannahxchen/refusal-direction-replication

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Replication of Refusal in Language Models Is Mediated by a Single Direction

This repository includes code and results for replicating the paper "Refusal in Language Models Is Mediated by a Single Direction". The code mostly follows the original codebase with some modifications to build on NNsight library.

Setup

All the required dependencies and packages are listed in environment.yml. To create a virtual enviornment from it, run the following command:

conda env create --name envname --file=environment.yml

Running the Code

To reproduce the main results, run the following command:

python -m refusal_direction.run --model_path MODEL_PATH

Alternatively, you can create a config.yaml file and use --config_file path/to/config.yaml instead.

The main pipeline includes the following steps:

  1. Data preprocessing
  2. Generate candidate directions
  3. Select a direction
  4. Run and save completions on evaluation datasets with different intervention
  5. Evaluate cross entropy loss on harmless data
  6. Evaluate model coherence on general language benchmarks

You can also resume from any of the steps listed above. For example, to resume from Step 4 to evaluate model loss, you can run the command:

python -m refusal_direction.run --config_file path/to/config.yaml --resume_from_step 4

Finally, to compute the refusal scores (substring match) and safety scores (Llama Guard 2) on the generated completions (from Step 3), run the command as follows:

python -m refusal_direction.run --config_file path/to/config.yaml --run_jailbreak_eval

Note: For the model coherence evaluation, the code is not provided in the paper's GitHub repository. We currently have implemented MMLU, ARC, and TruthfulQA tasks. We observe some accuracy differences between our implementation and results reported in the original paper. There may be some implementation differences.

Additional Evaluation

The evaluation in the original paper mainly focuses on evaluating the refusal direction. Additionally, we include evaluation on the magnitude of the refusal vector, which assesses how well the scalar projection on the refusal vector can reflect the degree of refusal scores in the model outputs. We perform evaluation on the harmful/harmless test split and also on a more challenging dataset, XSTest.

To run this evaluation, you can use the following command:

python -m refusal_direction.run --config_file path/to/config.yaml --run_magnitude_eval

Further analysis and results can be found in analysis/magnitude_evaluation.ipynb.

References

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published