Interpretation regularization

A novel neural network gradient regularization scheme for adversarial robustness and interpretability. It works by encouraging the gradients of the neural network logits to (1) have small magnitudes, like Jacobian regularization, and (2) align with supplied attribution maps. Our findings indicate that the more the supplied attribution maps highlight features that are robust, the more robust the network being trained with our method becomes.

Read our published paper for more information on the method.

Installation

Run $ python setup.py install to install the ir_utils package.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
ir_utils		ir_utils
.gitignore		.gitignore
README.md		README.md
attack_SimpleCNN.sh		attack_SimpleCNN.sh
attack_WRN-28-10.sh		attack_WRN-28-10.sh
attack_models.py		attack_models.py
generate_target_interps.ipynb		generate_target_interps.ipynb
setup.py		setup.py
train_SimpleCNN_ir.sh		train_SimpleCNN_ir.sh
train_WRN-28-10_ir.sh		train_WRN-28-10_ir.sh
train_models.py		train_models.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Interpretation regularization

Installation

About

Releases

Packages

Languages

a1noack/interp_regularization

Folders and files

Latest commit

History

Repository files navigation

Interpretation regularization

Installation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages