This fork contains the solution attempt by Marius and me during the Apart Alignment Jam, see here for our report.
Use mechanistic interpretability tools to reverse engineer an MNIST CNN and send me a program for the labeling function it was trained on.
Hint 1: The labels are binary.
Hint 2: The network gets 95.58% accuracy on the test set.
Hint 3: The labeling function can be described in words in one sentence.
Hint 4: This image may be helpful.
Use mechanistic interpretability tools to reverse engineer a transformer and send me a program for the labeling function it was trained on.
Hint 1: The labels are binary.
Hint 2: The network is trained on 50% of examples and gets 97.27% accuracy on the test half.
Hint 3: Here are the ground truth and learned labels. Notice how the mistakes the network makes are all near curvy parts of the decision boundary...
If you send me code for one of the two labeling functions along with a justified mechanisic interpretability explanation for it (e.g. in the form of a colab notebook), the prize is a $750 donation to a high-impact charity of your choice. So the total prize pool is $1,500 for both challenges. Thanks for Neel Nanda for contributing $500!