Mechanistic Interpretability Challenge

This fork contains the solution attempt by Marius and me during the Apart Alignment Jam, see here for our report.

Mechanistic Interpretability Challenge

Challenge 1, MNIST CNN:

Use mechanistic interpretability tools to reverse engineer an MNIST CNN and send me a program for the labeling function it was trained on.

Hint 1: The labels are binary.

Hint 2: The network gets 95.58% accuracy on the test set.

Hint 3: The labeling function can be described in words in one sentence.

Hint 4: This image may be helpful.

MNIST CNN challenge:

Challenge 2, Transformer:

Use mechanistic interpretability tools to reverse engineer a transformer and send me a program for the labeling function it was trained on.

Hint 1: The labels are binary.

Hint 2: The network is trained on 50% of examples and gets 97.27% accuracy on the test half.

Hint 3: Here are the ground truth and learned labels. Notice how the mistakes the network makes are all near curvy parts of the decision boundary...

Transformer challenge:

Rewards:

If you send me code for one of the two labeling functions along with a justified mechanisic interpretability explanation for it (e.g. in the form of a colab notebook), the prize is a $750 donation to a high-impact charity of your choice. So the total prize pool is $1,500 for both challenges. Thanks for Neel Nanda for contributing $500!

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
figs		figs
007_transformer_challenge.ipynb		007_transformer_challenge.ipynb
007_transformer_challenge.py		007_transformer_challenge.py
301_final_CNN_challenge.ipynb		301_final_CNN_challenge.ipynb
303_final_CNN_challenge.ipynb		303_final_CNN_challenge.ipynb
README.md		README.md
mnist_model.pt		mnist_model.pt
transformer_label_info.pkl		transformer_label_info.pkl
transformer_model.pt		transformer_model.pt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mechanistic Interpretability Challenge

Challenge 1, MNIST CNN:

Challenge 2, Transformer:

Rewards:

About

Releases

Packages

Languages

Stefan-Heimersheim/mechanistic_interpretability_challenge

Folders and files

Latest commit

History

Repository files navigation

Mechanistic Interpretability Challenge

Challenge 1, MNIST CNN:

Challenge 2, Transformer:

Rewards:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages