diff --git a/_posts/2025-04-28-localization.md b/_posts/2025-04-28-localization.md new file mode 100644 index 000000000..62016823a --- /dev/null +++ b/_posts/2025-04-28-localization.md @@ -0,0 +1,454 @@ +--- +layout: distill +title: Does Editing Provide Evidence for Localization? +description: A basic aspiration for interpretability research in large language models is to localize semantically meaningful behaviors to particular components within the LLM. There are various heuristics for finding candidate locations within the LLM. Once a candidate localization is found, it can be assessed by editing the internal representations at the corresponding localization and checking whether this induces model behavior that is consistent with the semantic interpretion of the localization. The question we address here is, how strong is the evidence provided by such edits? To assess localization, we want to assess the effect of the optimal intervention at a particular location. The key new technical tool is a way of adapting LLM alignment techniques to find such optimal localized edits. With this tool in hand, we give an example where the edit-based evidence for localization appears strong, but where localization clearly fails. Indeed, we find that optimal edits at random localizations can be as effective as aligning the full model. In aggregate, our results suggest that merely observing that localized edits induce targeted changes in behavior provides little to no evidence that these locations actually encode the target behavior. +date: 2025-04-28 +future: true +htmlwidgets: true +hidden: false + +authors: + - name: Anonymous + +bibliography: 2025-04-28-localization.bib + +toc: + - name: Introduction + - name: Backgrounds and results from ITI + - name: Editing Localized Heads Modifies the Output as Expected + - name: Finding "optimal" interventions + - name: Optimal interventions at localized heads are nearly optimal, but so are random heads + - name: Intervening a single head is just as effective + - name: Are the Probing-Localized Heads Anything Special? + - name: Discussion + - name: Experiment Details + +_styles: > + .fake-img { + background: #bbb; + border: 1px solid rgba(0, 0, 0, 0.1); + box-shadow: 0 0px 4px rgba(0, 0, 0, 0.1); + margin-bottom: 12px; + } + .fake-img p { + font-family: monospace; + color: white; + text-align: left; + margin: 12px 0; + text-align: center; + font-size: 16px; + } +--- + + +## Introduction + +

+A basic goal of interpretability research for large language models is to map semantically meaningful behavior to particular subcomponents of the model. +Semantically meaningful encompasses a wide range of things, e.g., "when asked for directions to the Eiffel tower, the model gives directions to Paris", "the model responds truthfully", or "the model will refuse to response". The aim is to find, e.g., neurons, circuits, or regions of representation space that control these behaviors. If we could find such localizations, we could use them as building blocks to understand complex model behaviors. +Many interpretability approaches can be understood in terms of the following idealized template : +

+ +

+1. We use some heuristic to find a candidate location in the model that is conjectured to be responsible for a particular behavior. +

+ +

+2. We then run the model with some set of inputs, and collect the model's internal representations for each input. +

+ +

+3. Then, we edit each of these representations at the candidate location, and generate new outputs according to the edited representations. +

+ +

+4. If the edit changes the model's behavior in the manner that would be expected from changing the target behavior, we take this as evidence in support of localization. +

+ +

+For example, if editing a particular location in the network shifts the model to give truthful answers, we may take this as evidence that the location meaningfully encodes truthfulness in some sense. Or, if editing a location causes the model to act as though the Eiffel tower is in Rome, we may take this as evidence that the location encodes the concept of the Eiffel tower. The basic question in this paper is: how strong is this evidence? That is, to what extent can we conclude that a particular location in the model is responsible for a particular behavior based on the success of editing at that location? +

+ +

+Our core contribution is an example where editing-based evidence appears very strong, but where localization clearly fails. The example replicates the setup of Inference-Time-Interference (ITI) , where the target concept is truthfulness, and the localization is in a small subset of 16 attention heads. Following ITI, we use logit-linear probing to identify candidate heads. We then search for the optimal localized edit to apply at these heads. Remarkably, we find that the optimal edit induces truthfulness behavior that is essentially as good as finetuning the entire model to be truthful. That is, the localized edit is as effective as can possibly be expected. Intuitively, this appears to be strong evidence that the locations found by the heuristic (probing) are indeed closely linked to the target concept (truthfulness). However, we then show that this evidence is misleading. We find that applying optimal edits to random heads are just as effective as when applied to the localized heads. Accordingly, the edit-based evidence provides no support for the localization hypothesis. +

+ +

+A possible out here is that 16 attention heads is too many, leaving us with significant leeway to induce any behavior we want with editing. We further strengthen the example by showing that it is possible to find a single head in the model where editing at that head is as effective as finetuning the entire model. This appears to be the strongest edit-based evidence for localization possible. However, we show that there are in fact multiple such heads. That is, there is simply no single privileged location that can be identified as responsible for the target behavior. +

+ +

+Our results suggest that the evidence provided by editing is weak, and that the success of editing at a particular location is not a reliable indicator of the location's importance for the target behavior. This seems to significantly constrain what can be learned from interpretability methods. It also points to the need for a more rigorous development of such techniques, including both precise statements of what the goals are, and well-grounded standards for evidence that these goals have been met. +

+ +

+The technical development in this paper relies on finding the optimal intervention at a specified location. To that end, we develop a method for localizing LoRA type finetuning to specific locations. This then allows us to frame the search for optimal edits as a finetuning-type optimization problem. This method may also be of independent interest. +

+ + +## Backgrounds and Results from ITI + +

+We replicate the setup of ITI . +

+ +

Dataset and Model Architecture

+ +

+We use TruthfulQA as our dataset. It contains 817 questions that humans might answer incorrectly due to misconceptions. Each question contains an average of 3.2 truthful answers and 4.1 false answers. We use 60% of the questions for training, and the rest for validation and testing. +

+ +

+We use an Alpaca-7B model that is finetuned from the Llama-7B base model. The model consists of $L = 32$ layers, each consisting of a Multi-head Attention (MHA) layer, and a Multilayer Perceptron (MLP) layer. We focus on the MHA layer, which has $H = 32$ attention heads, with each head having dimension $H = 128$ (the hidden dimension is $DH = 4096$). +

+ +

+Ignoring MLP and layer normalization, the computation at layer $l$ can be written as: +

+ + +$$ +\mathbf{o}_h^l := \text{Attn}_h^l(\mathbf{r}^l) \in \mathbb{R}^D +$$ + + +$$ +\mathbf{o}^l := [(\mathbf{o}_1^l)^T, \ldots, (\mathbf{o}_H^l)^T]^T \in \mathbb{R}^{DH} +$$ + + +$$ +W^l := [W_1^l, \ldots, W_H^l] \in \mathbb{R}^{DH \times DH} +$$ + + +$$ +\mathbf{r}^{l+1} := \mathbf{r}^l + W^l \mathbf{o} = \mathbf{r}^l + \sum_{h=1}^H W_h \mathbf{o}_h \in \mathbb{R}^{DH} +$$ + + +

+where $\mathbf{r}^l \in \mathbb{R}^{DH}$ is the residual stream before layer $l$, $\text{Attn}_h^l$ is the $h$-th attention module at layer $l$, with $\mathbf{o}_h^$ being its output. $\mathbf{o}^l$ is the concatenated head outputs. $W^l$ is the project-out matrix, that applies $H$ independent linear transformations to the corresponding head outputs. Finally $\mathbf{r}^{l+1}$ is residual stream output after layer $l$. +

+ +

Localization and Intervention Using Activation Statistics

+ +

+To localize, we collect representations for positive and negative examples, and use probing to find where the truthfulness concept is represented. To intervene, we find the direction best separating activations for positive and negative examples, and apply this direction to the representation. +

+ +

+Each example is of the form, $(x, y, x_{\text{random}})$, concatenating a question $x$, a corresponding answer $y$, and another random question $x_{\text{random}}$. For positive examples, we use a truthful response $y = y_{+}$, and for negative examples, we use an untruthful response $y = y_{-}$. To collect the representations, we feed the positive and negative examples through the model, and collect the activations of the attention heads, $\{\mathbf{o}_h^l\}_{h \in [H], l \in [L]}$, at the last token. +

+ +

+For each of the $L \times H$ head locations, we train a logistic regression probe on the $D$-dimensional activations to predict whether it's a positive or negative example. Then we pick the attention heads with the highest probing accuracies as the localized heads. +

+ +

+For the selected head at $(l, h)$, we find the direction $\mathbf{u}_h^l$ that is "best" at separating the activations of positive and negative examples. There are several variants, but according to , the best option is the mass mean shift, which is the difference between the average positive and negative activations. Then we estimate the standard deviation of activations along the direction to be $\sigma_h^l$, and use the weighted direction $\boldsymbol{\theta}_h^l := \sigma_h^l \mathbf{u}_h^l$ as the intervention vector, which we add to the corresponding head during inference autoregressively. +

+ +

+More specifically, the applied intervention is: +

+ + +$$ +\mathbf{r}^{l+1}_{\text{ITI}} := \mathbf{r}^l + W^l ( \mathbf{o} + \alpha \boldsymbol{\theta}^l) +$$ + + +$$ += \mathbf{r}^{l+1}_{\text{orig}} + \alpha W^l \boldsymbol{\theta}^l = \mathbf{r}^{l+1}_{\text{orig}} + \alpha \sum_{h=1}^H W_h^l \boldsymbol{\theta}_h^l +$$ + +

+where $\mathbf{\theta}_l$ is the concatenated intervention vectors across all heads at layer $l$, and $\alpha$ is the intervention strength. This intervention is repeated for each next token prediction autoregressively until the whole answer is completed. +

+ +

Evaluation Metrics

+ +

+Since the goal is to assess model's generation quality, it's natural to use truthfulness score and informativeness score of generations as the evaluation metrics. They use GPT-judge models to evaluate the model's generations for truthfulness and informativeness, and use Info*Truth (the product of scalar truthful and informative scores) as the main metric. +

+ +

+We also report other metrics as in the ITI paper: KL divergence of the model's next-token prediction distribution post- versus pre-intervention, and multiple-choice accuracy (MC) which is determined via comparing the conditional probabilities of candidate answers given the question. +

+ +

Editing Localized Heads Modifies the Output as Expected

+ +

+In ITI, the authors find that editing on 16 localized heads (out of a total of 1024 heads) successfully steers model generations to be more truthful while still being informative. They also find intervening on all attention heads doesn't make model generations more truthful than intervening just at the localized heads. This seems to suggest that the truthfulness concept is indeed encoded in the localized heads. +

+ +

+We now strengthen this evidence further. Similar to , we check if interventions at random heads can also make model generations more truthful. More specifically: +

+ +

+1. Randomly select 16 heads, and compute intervention vectors $\theta$'s accordingly. +

+ +

+2. Apply varying intervention strength α, collect model generations, and compute scores for truthfulness and informativeness using GPT-judge across all intervention strengths. +

+ +

+3. Repeat for 16 times. +

+ +

+We find that interventions at the localized heads are more effective than interventions at random heads. In Figure below we report the Info*Truth score (average truthfulness score times average informativeness score). We find that using localized heads have significantly higher Truth*Info scores than using random heads (p-value $1.6×10^{-8}$). In fact, using random heads often doesn't have noticeable effect on the truthfulness at all, as shown in the truth-info plot (Figure) and KL-MC trade-off plot (Figure). +

+ +
+
+ {% include figure.html path="assets/img/2025-04-28-localization/hist_iti.png" title="Info*Truth Scores" class="img-fluid" %} +
+ Info*Truth Scores +
+
+
+ {% include figure.html path="assets/img/2025-04-28-localization/iti_truth_info.png" title="Truth vs Info Scores" class="img-fluid" %} +
+ Truth vs Info Scores +
+
+
+ {% include figure.html path="assets/img/2025-04-28-localization/iti_kl_mc.png" title="KL vs MC Scores" class="img-fluid" %} +
+ KL vs MC Scores +
+
+
+
+ Localized heads perform much better than random when using ITI interventions. We observe better Truth*Info scores, better truth-info score tradeoff, as well as better MC-KL tradeoff. +
+ +

+This appears to add further evidence that the localized heads are "special" for the truthfulness concept. However, this strong association could be because the intervention and localization are "correlated", since both use statistics of the same activations (determined by the design of the data, etc). E.g. for heads with very low probing accuracy, the estimated intervention vectors could be very noisy, and thus the interventions could be less effective. +

+ +

Finding "Optimal" Interventions

+ +

+To test whether a particular behavior is localized to specific location, we would like to assess the effect of the optimal intervention at that location. In the case of our running example, we want the localized edit to the representation space that does the best job of steering the model's generations to be more truthful while maintaining informativeness. Then, the questions are: what is the best we could hope to achieve? (I.e., what is "optimal"?) And, (how) can we find a localized edit that achieves it? +

+ +

Fitting the Alignment Objective Gives Optimal Interventions

+ +
+{% include figure.html path="assets/img/2025-04-28-localization/stronger_evidence_for_loc.png" class="img-fluid" %} +
+IPO interventions achieve much better performance than using ITI. Using IPO interventions at localized heads give nearly optimal info-truth tradeoff as well. +
+
+ +

+The key observation is that the dataset used to construct positive and negative examples can be restructured as paired "preference" data $\{(x_i, y_i^+, y_i^-)\}_i$, where $x_i$ is the question, $y_i^+$ is the truthful answer, and $y_i^-$ is the untruthful answer. Since the goal is to make model generations more truthful, we can directly adopt contrastive alignment methods for biasing the model towards the truthful answers. In this case, we use the IPO learning objective, where the goal is to upweight probabilities for $y_i^+$ and downweight probabilities for $y_i^-$ (up to some threshold): +

+ +

+$ +\text{argmax}_{\phi} \sum_i \left[\log \left( \frac{\pi_{\phi}(y_i^{+} | x_i)}{\pi_0(y_i^{+} | x_i)} / \frac{\pi_{\phi}(y_i^{-} | x_i)}{\pi_0(y_i^{-} | x_i)} \right) - \frac{\tau^{-1}}{2}\right]^2 +$ +

+ +

+where $\pi_\phi(\cdot \vert x)$ is the model's generation probability, $\pi_0(\cdot \vert x)$ is the original model's generation probability, and $\tau$ decides the threshold. Ideally, the optimized $\pi_{\phi^*}(\cdot \vert x)$ should generate responses that are more truthful than the original model, while minimally affecting the off-target aspects of the generation (in this case, the informativeness of the responses). +

+ +

+To test the effectiveness of IPO alignment, we finetune the weights for project-out matrices $W^l$'s defined in Equation 3 using (rank 1) LoRA . The finetuned model gives nearly perfect trade-off between truthfulness and informativeness, that is far better than ITI interventions (see Figure). This also suggests that ITI heuristics are very far from optimal and contrasts with ITI results that intervening on all heads doesn't make model generations more truthful. +

+ + +

+Now we treat this result as the overall best performance that we can achieve with interventions. We want to see if optimal interventions at localized heads can achieve the same performance, and if random heads can achieve the same performance. +

+ +

Connect Weight Updates to Representation Editing

+ +

+The connection to IPO lets us search for the best possible update to the model's weights. However, we are interested in localized edits to model representations. To continue, we need to connect the weight editing to representation editing. +

+ +

Rank-1 LoRA

+ +

+Directly applying rank-1 LoRA to W^l, we can view the effect of adding in the modified LoRA weight matrix as an edit to the representation as follows: +

+ +

+$ +\mathbf{r}_{\text{LoRA}}^{l+1} := \mathbf{r}^l + (W^l + \mathbf{b}^l (\mathbf{a}^l)^T)\mathbf{o}^l = \mathbf{r}_{\text{orig}}^{l+1} + \left< \mathbf{a}^l, \mathbf{o}^l\right> \mathbf{b}^l +$ +

+ +

+where $a^l, b^l$ are the LoRA weights to optimize. Comparing with the ITI intervention , we see that $b^l$ plays the role of the added $W^l \theta^l$, and $\langle a^l, o^l \rangle$ is the intervention strength but is adapted to the representation $\mathbf{o}$.One could replace $\langle a^l, o^l \rangle$ with a constant intervention strength, but allowing the extra flexibility is closer to the ideal of best-possible-localized-intervention. +

+ +

+This formulation connects weight edits to representation edits. However, it doesn't yet allow us to localize edits to specific heads -- while $\theta^l$ can be read as concatenation of headwise intervention vectors, the projected $W^l \theta^l$ have no corresponding interpretations. Therefore, we can't restrict the edits to specific heads by imposing structure on $b^l$'s. +

+ +

Rank-1 LoRA with Reparameterization

+ +

+We can make more direct connections by reparameterizing $b^l$ with $W^l b^l$ (without changing expressiveness): +

+ +

+$ +\mathbf{r}_{\text{LoRA-reparam}}^{l+1} := \mathbf{r}_{\text{orig}}^{l+1} + \left< \mathbf{a}^l, \mathbf{o}^l\right> W^l \mathbf{b}^l = \mathbf{r}_{\text{orig}}^{l+1} + \left< \mathbf{a}^l, \mathbf{o}^l\right> \sum_{h=1}^H W_h^l \mathbf{b}_h^l +$ +

+ +

+Here $b_h^l$ plays the role of the intervention vector $\theta_h^l$, and $a^l$ decides the intervention strength adaptively. +

+ +

+Now we have the algorithm to find the optimal interventions for the chosen set of heads: +

+ +

+1. Finetune the model weights using reparameterized LoRA with the IPO objective. +

+ +

+2. And, restrict $b^l$ to be nonzero only for the chosen set of heads. +

+ +

Optimal Interventions at Localized Heads are Nearly Optimal, but so are Random Heads

+ +
+
+ {% include figure.html path="assets/img/2025-04-28-localization/hist_ipo.png" class="img-fluid" %} +
+
+ {% include figure.html path="assets/img/2025-04-28-localization/random_vs_top.png" class="img-fluid" %} +
+
+
+Using IPO optimal localized interventions, randomly selected heads perform nearly optimally for steering model generations. In particular, random heads are as good as the conjectured localized heads. The random heads are the same as those in earlier truth-info plots. +
+ +

Optimal Edits at Conjectured Localization

+ +

+We can now search for the best possible interventions at the localized heads. The earlier figure shows the result. We find that the optimal interventions strongly outperform the heuristic ITI interventions. Moreover, the localized interventions are about as effective as full IPO alignment! This appears to be the strongest edit-based evidence for localization that we could hope for. +

+ +

Optimal Edits at Random Localization

+ +

+Now, we apply the same optimal edit procedure to 16 randomly selected heads. The figure above shows the results. In short: the optimal interventions at random heads are often just as effective as the optimal interventions at the localized heads. Accordingly, the fact that editing at the localized heads was effective at steering generations provides no evidence that the truthfulness concept is localized to those heads. +

+ +

+Further, the random heads we use here are the same random heads used earlier. Using the ITI heuristic intervention, the selected heads looked highly different from these random heads. But we now see that this appears to be an artifact of the suboptimal interventions and choice of metric, rather than a meaningful difference in how the heads relate to truthfulness. +

+ +

Intervening a Single Head is Just as Effective

+ +
+
+ {% include figure.html path="assets/img/2025-04-28-localization/hist_ipo_1.png" class="img-fluid" %} +
+
+ {% include figure.html path="assets/img/2025-04-28-localization/single_vs_top.png" class="img-fluid" %} +
+
+
+Using a single-head is as effective, and there are multiple of them! +
+ +

+It is now clear that edit-based evidence does not provide strong evidence for localization in the 16 head setup. However, a possible way of saving localization would be to argue that 16 heads is too many, giving too much leeway to induce any behavior we want with editing. For example, if we edited half the heads of the model, it would not be surprising if we could make the model do anything we wanted. Accordingly, we might hope that there is still a valid syllogism of the form "the localized edit is extremely constrainted" and "edits at this location optimally control the target behavior" implies "the target behavior is localized to this location". +

+ +

+To test this, we now focus on the single head case. The procedure is simple: we randomly sample 24 single heads, one at a time, and search for optimal interventions. The distribution of the best Truth*Info scores is shown in the figure above. We find 5 single-heads that are as effective, and none of them has high probing accuracy. Notice that, still, none of these heads can be understood as localizing the truthfulness concept. The reason is that there are multiple distinct locations that work equally well! That is, even in the extreme case of a very localized edit that replicates the target behavior essentially optimally, we still cannot conclude that there is evidence supporting localization. +

+ +

Are the Probing-Localized Heads Anything Special?

+ +

+So far what we mean by localization, is that we can change model generation on target concept by an edits at this location. And our experiments show no evidence for this type of localization, and probing-localized heads play no special role. +

+ +

+So, are the probing-localized heads anything special at all? +

+ +

Probing-localized Heads Seems Special for MC Scores

+ +

+We do observe that these heads achieve slightly better Multiple-Choice (MC) scores compared to randomly selected heads (see figures below), although this advantage is not as pronounced as with the ITI interventions (see earlier figures). Thus, these heads may be special in terms of changing model probabilities on the given fixed dataset, which is what MC measures. +

+ +

The Gap Between What the Model "Knows" and What it Generates

+ +

+It's important to note that the model's probabilities for fixed responses, do not directly correspond to what the model actually generates. Even if the model assigns a higher probability to a truthful response than an untruthful one, it may still not generate the truthful response if the fixed dataset is off-policy (i.e. both probabilities are low). This highlights the well-known gap between what a model "knows" (which is the motivation behind probing) and what it ultimately generates . +

+ +
+{% include figure.html path="assets/img/2025-04-28-localization/random_vs_top_MC_KL.png" class="img-fluid" %} +
+Probing-localized heads seem somewhat special in MC scores. +
+
+ +

Implications

+ +

+It's possible that while probing-localized heads are not special at all for controlling model generations, they are special in changing what the model "knows". Though we caution that the results here are not rigorous evidence for localization even in this sense. Even if there is a knowledge localization in some sense, it is clear that this does not inform steering, and does not give a way of monitoring model behavior (because changes in completely unrelated locations can change the behavior). This points to the need for making the goal of localization precise. +

+ +

Discussion

+ +

+The main idea in this paper is that to assess the localization of a behavior we should study the effect of the optimal intervention at the conjectured localization. The main obstacle is that, in general, it is not clear how to define or find the optimal intervention. To overcome this, we map the problem of finding the optimal intervention to the problem of finding the optimal weight update, which can be solved using existing LLM alignment methods. +

+ +

+The main result is an example where, naively, the evidence for localization appears strong, but when we use optimal interventions, the evidence disappears. +

+ +

+The particular example—truthfulness and ITI-based evidence—was selected simply because the data used to define the heuristic happens to also allow us to set up a contrastive alignment problem. The most limited read of the results here is that ITI interventions do not provide evidence for localization, and that truthfulness does not appear to be localizable. However, the broader point is that by giving an example where editing-based evidence doesn't support localization, we see that in general such edits—by themselves—cannot provide evidence for localization. This is true irrespective of the particular behavior or heuristic being evaluated. +

+ +

+Thus far, we've been a bit vague about what localization means. Editing does tautological evidence for localization in the sense of "it's possible to modify model behavior on such-and-such a behavior by an edit at this location". On the opposite end, the strongest possible standard would be to show that the location is unique, or at least necessary. This is the standard that would be required if our aim was, e.g., to establish that LLM truthfulness can be monitored by examining a small set of heads. Potentially, there are interesting and useful notions of localization in between these two extremes. However, we can see no useful sense of localization that is consistent with the location being only as good as a randomly selected alternative. As we have seen, heuristic edit-based evaluation cannot even rule out this case. +

+ +

+Our findings add to a growing body of work that assesses the validity of interpretability results. argue that the Knowledge Neuron thesis, which suggests that facts are stored in MLP weights, is an oversimplification and does not adequately explain the process of factual expression in language models. demonstrate that subspace activation patching can lead to an illusory sense of interpretability, as the effects may be achieved through dormant parallel pathways rather than the hypothesized subspaces. Most relevant to our work, find that localization conclusions from causal tracing do not provide insight into which model MLP layer would be best to edit to override an existing stored fact. +

+ +

+Overall, the results here point to the need for precise statements of what the objectives are in interpretability. With clear objectives, it may be possible to develop theoretically grounded methods for evaluation. Precise, falsifiable, statements and clear standards of evidence would suffice to prevent the kind of failure we observe in this paper. +

+ +

Experiment Details

+ +

Dataset and Model Architecture

+

+We use the TruthfulQA dataset and the Alpaca-7B model for our experiments. The dataset contains 817 questions with truthful and untruthful answers. We turn them into pairs, and use 60% for training (6560 paired data) and the rest for validation and testing. The model consists of 32 layers, each with 32 attention heads and a hidden dimension of 4096. +

+ +

Training Details

+

+We use IPO objective and use hyperparameter $\tau = 0.1, 0.2, 0.3, 0.4, 0.5$. We train for two epochs with a cosine scheduler, with a batch size of 4. We use "paged_adamw_32bit" optimizer. For training with different numbers of heads, we find a smaller number of heads benefit from a higher learning rate. For all-heads, we use a learning rate of $1 \times 10^{-4}$, and for 16 heads, we use $5 \times 10^{-4}$. For single-head, we use $2 \times 10^{-3}$. +

+ +

Evaluation Metrics

+

+We reuse code from ITI for evaluation when possible. For GPT-judge models, we follow and finetune on truthfulness and informativeness dataset using OpenAI API . Our finetuned model achieves similar validation error as in . +

\ No newline at end of file diff --git a/assets/bibliography/2025-04-28-localization.bib b/assets/bibliography/2025-04-28-localization.bib new file mode 100644 index 000000000..642d85312 --- /dev/null +++ b/assets/bibliography/2025-04-28-localization.bib @@ -0,0 +1,242 @@ +@article{li2024inference, + title={Inference-time intervention: Eliciting truthful answers from a language model}, + author={Li, Kenneth and Patel, Oam and Vi{\'e}gas, Fernanda and Pfister, Hanspeter and Wattenberg, Martin}, + journal={Advances in Neural Information Processing Systems}, + volume={36}, + year={2024} +} + +@article{lin2021truthfulqa, + title={Truthfulqa: Measuring how models mimic human falsehoods}, + author={Lin, Stephanie and Hilton, Jacob and Evans, Owain}, + journal={arXiv preprint arXiv:2109.07958}, + year={2021} +} + +@article{hase2024does, + title={Does localization inform editing? surprising differences in causality-based localization vs. knowledge editing in language models}, + author={Hase, Peter and Bansal, Mohit and Kim, Been and Ghandeharioun, Asma}, + journal={Advances in Neural Information Processing Systems}, + volume={36}, + year={2024} +} + +@article{taori2023alpaca, + title={Alpaca: A strong, replicable instruction-following model}, + author={Taori, Rohan and Gulrajani, Ishaan and Zhang, Tianyi and Dubois, Yann and Li, Xuechen and Guestrin, Carlos and Liang, Percy and Hashimoto, Tatsunori B}, + journal={Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html}, + volume={3}, + number={6}, + pages={7}, + year={2023} +} + + + +@inproceedings{azar2024general, + title={A general theoretical paradigm to understand learning from human preferences}, + author={Azar, Mohammad Gheshlaghi and Guo, Zhaohan Daniel and Piot, Bilal and Munos, Remi and Rowland, Mark and Valko, Michal and Calandriello, Daniele}, + booktitle={International Conference on Artificial Intelligence and Statistics}, + pages={4447--4455}, + year={2024}, + organization={PMLR} +} + +@article{hu2021lora, + title={Lora: Low-rank adaptation of large language models}, + author={Hu, Edward J and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu}, + journal={arXiv preprint arXiv:2106.09685}, + year={2021} +} + +## representation engineering +@article{zou2023representation, + title={Representation engineering: A top-down approach to ai transparency}, + author={Zou, Andy and Phan, Long and Chen, Sarah and Campbell, James and Guo, Phillip and Ren, Richard and Pan, Alexander and Yin, Xuwang and Mazeika, Mantas and Dombrowski, Ann-Kathrin and others}, + journal={arXiv preprint arXiv:2310.01405}, + year={2023} +} + + +@article{arditi2024refusal, + title={Refusal in Language Models Is Mediated by a Single Direction}, + author={Arditi, Andy and Obeso, Oscar and Syed, Aaquib and Paleka, Daniel and Rimsky, Nina and Gurnee, Wes and Nanda, Neel}, + journal={arXiv preprint arXiv:2406.11717}, + year={2024} +} + +@article{wang2023backdoor, + title={Backdoor activation attack: Attack large language models using activation steering for safety-alignment}, + author={Wang, Haoran and Shu, Kai}, + journal={arXiv preprint arXiv:2311.09433}, + year={2023} +} + +@inproceedings{chen2024truth, + title={Truth forest: Toward multi-scale truthfulness in large language models through intervention without tuning}, + author={Chen, Zhongzhi and Sun, Xingwu and Jiao, Xianfeng and Lian, Fengzong and Kang, Zhanhui and Wang, Di and Xu, Chengzhong}, + booktitle={Proceedings of the AAAI Conference on Artificial Intelligence}, + volume={38}, + number={19}, + pages={20967--20974}, + year={2024} +} + +@article{wei2024assessing, + title={Assessing the brittleness of safety alignment via pruning and low-rank modifications}, + author={Wei, Boyi and Huang, Kaixuan and Huang, Yangsibo and Xie, Tinghao and Qi, Xiangyu and Xia, Mengzhou and Mittal, Prateek and Wang, Mengdi and Henderson, Peter}, + journal={arXiv preprint arXiv:2402.05162}, + year={2024} +} + +## mechanistic intrepretability +@article{meng2022locating, + title={Locating and editing factual associations in GPT}, + author={Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan}, + journal={Advances in Neural Information Processing Systems}, + volume={35}, + pages={17359--17372}, + year={2022} +} + +@article{vig2020causal, + title={Causal mediation analysis for interpreting neural nlp: The case of gender bias}, + author={Vig, Jesse and Gehrmann, Sebastian and Belinkov, Yonatan and Qian, Sharon and Nevo, Daniel and Sakenis, Simas and Huang, Jason and Singer, Yaron and Shieber, Stuart}, + journal={arXiv preprint arXiv:2004.12265}, + year={2020} +} + +@article{geiger2021causal, + title={Causal abstractions of neural networks}, + author={Geiger, Atticus and Lu, Hanson and Icard, Thomas and Potts, Christopher}, + journal={Advances in Neural Information Processing Systems}, + volume={34}, + pages={9574--9586}, + year={2021} +} + + +@article{soulos2019discovering, + title={Discovering the compositional structure of vector representations with role learning networks}, + author={Soulos, Paul and McCoy, Tom and Linzen, Tal and Smolensky, Paul}, + journal={arXiv preprint arXiv:1910.09113}, + year={2019} +} + +@article{finlayson2021causal, + title={Causal analysis of syntactic agreement mechanisms in neural language models}, + author={Finlayson, Matthew and Mueller, Aaron and Gehrmann, Sebastian and Shieber, Stuart and Linzen, Tal and Belinkov, Yonatan}, + journal={arXiv preprint arXiv:2106.06087}, + year={2021} +} + + + +@article{wang2022interpretability, + title={Interpretability in the wild: a circuit for indirect object identification in gpt-2 small}, + author={Wang, Kevin and Variengien, Alexandre and Conmy, Arthur and Shlegeris, Buck and Steinhardt, Jacob}, + journal={arXiv preprint arXiv:2211.00593}, + year={2022} +} + +@inproceedings{chan2022causal, + title={Causal scrubbing: A method for rigorously testing interpretability hypotheses}, + author={Chan, Lawrence and Garriga-Alonso, Adria and Goldowsky-Dill, Nicholas and Greenblatt, Ryan and Nitishinskaya, Jenny and Radhakrishnan, Ansh and Shlegeris, Buck and Thomas, Nate}, + booktitle={AI Alignment Forum}, + pages={1828--1843}, + year={2022} +} + + +@article{hanna2024does, + title={How does gpt-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model}, + author={Hanna, Michael and Liu, Ollie and Variengien, Alexandre}, + journal={Advances in Neural Information Processing Systems}, + volume={36}, + year={2024} +} + +@article{conmy2023towards, + title={Towards automated circuit discovery for mechanistic interpretability}, + author={Conmy, Arthur and Mavor-Parker, Augustine and Lynch, Aengus and Heimersheim, Stefan and Garriga-Alonso, Adri{\`a}}, + journal={Advances in Neural Information Processing Systems}, + volume={36}, + pages={16318--16352}, + year={2023} +} + + +@article{todd2023function, + title={Function vectors in large language models}, + author={Todd, Eric and Li, Millicent L and Sharma, Arnab Sen and Mueller, Aaron and Wallace, Byron C and Bau, David}, + journal={arXiv preprint arXiv:2310.15213}, + year={2023} +} + + +@article{hendel2023context, + title={In-context learning creates task vectors}, + author={Hendel, Roee and Geva, Mor and Globerson, Amir}, + journal={arXiv preprint arXiv:2310.15916}, + year={2023} +} + + +## Truthfulness +@article{joshi2023personas, + title={Personas as a way to model truthfulness in language models}, + author={Joshi, Nitish and Rando, Javier and Saparov, Abulhair and Kim, Najoung and He, He}, + journal={arXiv preprint arXiv:2310.18168}, + year={2023} +} + +@article{wang2020language, + title={Language models are open knowledge graphs}, + author={Wang, Chenguang and Liu, Xiao and Song, Dawn}, + journal={arXiv preprint arXiv:2010.11967}, + year={2020} +} + +@article{kadavath2022language, + title={Language models (mostly) know what they know}, + author={Kadavath, Saurav and Conerly, Tom and Askell, Amanda and Henighan, Tom and Drain, Dawn and Perez, Ethan and Schiefer, Nicholas and Hatfield-Dodds, Zac and DasSarma, Nova and Tran-Johnson, Eli and others}, + journal={arXiv preprint arXiv:2207.05221}, + year={2022} +} + +@article{saunders2022self, + title={Self-critiquing models for assisting human evaluators}, + author={Saunders, William and Yeh, Catherine and Wu, Jeff and Bills, Steven and Ouyang, Long and Ward, Jonathan and Leike, Jan}, + journal={arXiv preprint arXiv:2206.05802}, + year={2022} +} + +@article{burns2022discovering, + title={Discovering latent knowledge in language models without supervision}, + author={Burns, Collin and Ye, Haotian and Klein, Dan and Steinhardt, Jacob}, + journal={arXiv preprint arXiv:2212.03827}, + year={2022} +} + +@misc{openai2020api, + author = {{OpenAI}}, + title = {OpenAI API}, + year = {2020}, + url = {https://openai.com/blog/openai-api/}, + note = {Accessed: 2021-08-19} +} + + +@inproceedings{makelov2023subspace, + title={Is this the subspace you are looking for? An interpretability illusion for subspace activation patching}, + author={Makelov, Aleksandar and Lange, Georg and Geiger, Atticus and Nanda, Neel}, + booktitle={The Twelfth International Conference on Learning Representations}, + year={2023} +} + +@article{niu2024does, + title={What does the Knowledge Neuron Thesis Have to do with Knowledge?}, + author={Niu, Jingcheng and Liu, Andrew and Zhu, Zining and Penn, Gerald}, + journal={arXiv preprint arXiv:2405.02421}, + year={2024} +} \ No newline at end of file diff --git a/assets/img/2025-04-28-localization/hist_ipo.png b/assets/img/2025-04-28-localization/hist_ipo.png new file mode 100644 index 000000000..77ea5178b Binary files /dev/null and b/assets/img/2025-04-28-localization/hist_ipo.png differ diff --git a/assets/img/2025-04-28-localization/hist_ipo_1.png b/assets/img/2025-04-28-localization/hist_ipo_1.png new file mode 100644 index 000000000..b55329574 Binary files /dev/null and b/assets/img/2025-04-28-localization/hist_ipo_1.png differ diff --git a/assets/img/2025-04-28-localization/hist_iti.png b/assets/img/2025-04-28-localization/hist_iti.png new file mode 100644 index 000000000..f244770a8 Binary files /dev/null and b/assets/img/2025-04-28-localization/hist_iti.png differ diff --git a/assets/img/2025-04-28-localization/iti_kl_mc.png b/assets/img/2025-04-28-localization/iti_kl_mc.png new file mode 100644 index 000000000..67a8a44d1 Binary files /dev/null and b/assets/img/2025-04-28-localization/iti_kl_mc.png differ diff --git a/assets/img/2025-04-28-localization/iti_truth_info.png b/assets/img/2025-04-28-localization/iti_truth_info.png new file mode 100644 index 000000000..9eaac2274 Binary files /dev/null and b/assets/img/2025-04-28-localization/iti_truth_info.png differ diff --git a/assets/img/2025-04-28-localization/random_vs_top.png b/assets/img/2025-04-28-localization/random_vs_top.png new file mode 100644 index 000000000..05aae9a22 Binary files /dev/null and b/assets/img/2025-04-28-localization/random_vs_top.png differ diff --git a/assets/img/2025-04-28-localization/random_vs_top_MC_KL.png b/assets/img/2025-04-28-localization/random_vs_top_MC_KL.png new file mode 100644 index 000000000..57c0596e0 Binary files /dev/null and b/assets/img/2025-04-28-localization/random_vs_top_MC_KL.png differ diff --git a/assets/img/2025-04-28-localization/random_vs_top_truth_KL.png b/assets/img/2025-04-28-localization/random_vs_top_truth_KL.png new file mode 100644 index 000000000..f255d248e Binary files /dev/null and b/assets/img/2025-04-28-localization/random_vs_top_truth_KL.png differ diff --git a/assets/img/2025-04-28-localization/single_vs_top.png b/assets/img/2025-04-28-localization/single_vs_top.png new file mode 100644 index 000000000..b7970f014 Binary files /dev/null and b/assets/img/2025-04-28-localization/single_vs_top.png differ diff --git a/assets/img/2025-04-28-localization/stronger_evidence_for_loc.png b/assets/img/2025-04-28-localization/stronger_evidence_for_loc.png new file mode 100644 index 000000000..cc52323c6 Binary files /dev/null and b/assets/img/2025-04-28-localization/stronger_evidence_for_loc.png differ