diff --git a/_posts/2025-04-28-localization.md b/_posts/2025-04-28-localization.md new file mode 100644 index 000000000..62016823a --- /dev/null +++ b/_posts/2025-04-28-localization.md @@ -0,0 +1,454 @@ +--- +layout: distill +title: Does Editing Provide Evidence for Localization? +description: A basic aspiration for interpretability research in large language models is to localize semantically meaningful behaviors to particular components within the LLM. There are various heuristics for finding candidate locations within the LLM. Once a candidate localization is found, it can be assessed by editing the internal representations at the corresponding localization and checking whether this induces model behavior that is consistent with the semantic interpretion of the localization. The question we address here is, how strong is the evidence provided by such edits? To assess localization, we want to assess the effect of the optimal intervention at a particular location. The key new technical tool is a way of adapting LLM alignment techniques to find such optimal localized edits. With this tool in hand, we give an example where the edit-based evidence for localization appears strong, but where localization clearly fails. Indeed, we find that optimal edits at random localizations can be as effective as aligning the full model. In aggregate, our results suggest that merely observing that localized edits induce targeted changes in behavior provides little to no evidence that these locations actually encode the target behavior. +date: 2025-04-28 +future: true +htmlwidgets: true +hidden: false + +authors: + - name: Anonymous + +bibliography: 2025-04-28-localization.bib + +toc: + - name: Introduction + - name: Backgrounds and results from ITI + - name: Editing Localized Heads Modifies the Output as Expected + - name: Finding "optimal" interventions + - name: Optimal interventions at localized heads are nearly optimal, but so are random heads + - name: Intervening a single head is just as effective + - name: Are the Probing-Localized Heads Anything Special? + - name: Discussion + - name: Experiment Details + +_styles: > + .fake-img { + background: #bbb; + border: 1px solid rgba(0, 0, 0, 0.1); + box-shadow: 0 0px 4px rgba(0, 0, 0, 0.1); + margin-bottom: 12px; + } + .fake-img p { + font-family: monospace; + color: white; + text-align: left; + margin: 12px 0; + text-align: center; + font-size: 16px; + } +--- + + +## Introduction + +
+A basic goal of interpretability research for large language models is to map semantically meaningful behavior to particular subcomponents of the model.
+Semantically meaningful encompasses a wide range of things, e.g., "when asked for directions to the Eiffel tower, the model gives directions to Paris", "the model responds truthfully", or "the model will refuse to response". The aim is to find, e.g., neurons, circuits, or regions of representation space that control these behaviors. If we could find such localizations, we could use them as building blocks to understand complex model behaviors.
+Many interpretability approaches can be understood in terms of the following idealized template
+1. We use some heuristic to find a candidate location in the model that is conjectured to be responsible for a particular behavior. +
+ ++2. We then run the model with some set of inputs, and collect the model's internal representations for each input. +
+ ++3. Then, we edit each of these representations at the candidate location, and generate new outputs according to the edited representations. +
+ ++4. If the edit changes the model's behavior in the manner that would be expected from changing the target behavior, we take this as evidence in support of localization. +
+ ++For example, if editing a particular location in the network shifts the model to give truthful answers, we may take this as evidence that the location meaningfully encodes truthfulness in some sense. Or, if editing a location causes the model to act as though the Eiffel tower is in Rome, we may take this as evidence that the location encodes the concept of the Eiffel tower. The basic question in this paper is: how strong is this evidence? That is, to what extent can we conclude that a particular location in the model is responsible for a particular behavior based on the success of editing at that location? +
+ +
+Our core contribution is an example where editing-based evidence appears very strong, but where localization clearly fails. The example replicates the setup of Inference-Time-Interference (ITI)
+A possible out here is that 16 attention heads is too many, leaving us with significant leeway to induce any behavior we want with editing. We further strengthen the example by showing that it is possible to find a single head in the model where editing at that head is as effective as finetuning the entire model. This appears to be the strongest edit-based evidence for localization possible. However, we show that there are in fact multiple such heads. That is, there is simply no single privileged location that can be identified as responsible for the target behavior. +
+ ++Our results suggest that the evidence provided by editing is weak, and that the success of editing at a particular location is not a reliable indicator of the location's importance for the target behavior. This seems to significantly constrain what can be learned from interpretability methods. It also points to the need for a more rigorous development of such techniques, including both precise statements of what the goals are, and well-grounded standards for evidence that these goals have been met. +
+ ++The technical development in this paper relies on finding the optimal intervention at a specified location. To that end, we develop a method for localizing LoRA type finetuning to specific locations. This then allows us to frame the search for optimal edits as a finetuning-type optimization problem. This method may also be of independent interest. +
+ + +## Backgrounds and Results from ITI + +
+We replicate the setup of ITI
+We use TruthfulQA
+We use an Alpaca-7B
+Ignoring MLP and layer normalization, the computation at layer $l$ can be written as: +
+ + +$$ +\mathbf{o}_h^l := \text{Attn}_h^l(\mathbf{r}^l) \in \mathbb{R}^D +$$ + + +$$ +\mathbf{o}^l := [(\mathbf{o}_1^l)^T, \ldots, (\mathbf{o}_H^l)^T]^T \in \mathbb{R}^{DH} +$$ + + +$$ +W^l := [W_1^l, \ldots, W_H^l] \in \mathbb{R}^{DH \times DH} +$$ + + +$$ +\mathbf{r}^{l+1} := \mathbf{r}^l + W^l \mathbf{o} = \mathbf{r}^l + \sum_{h=1}^H W_h \mathbf{o}_h \in \mathbb{R}^{DH} +$$ + + ++where $\mathbf{r}^l \in \mathbb{R}^{DH}$ is the residual stream before layer $l$, $\text{Attn}_h^l$ is the $h$-th attention module at layer $l$, with $\mathbf{o}_h^$ being its output. $\mathbf{o}^l$ is the concatenated head outputs. $W^l$ is the project-out matrix, that applies $H$ independent linear transformations to the corresponding head outputs. Finally $\mathbf{r}^{l+1}$ is residual stream output after layer $l$. +
+ ++To localize, we collect representations for positive and negative examples, and use probing to find where the truthfulness concept is represented. To intervene, we find the direction best separating activations for positive and negative examples, and apply this direction to the representation. +
+ ++Each example is of the form, $(x, y, x_{\text{random}})$, concatenating a question $x$, a corresponding answer $y$, and another random question $x_{\text{random}}$. For positive examples, we use a truthful response $y = y_{+}$, and for negative examples, we use an untruthful response $y = y_{-}$. To collect the representations, we feed the positive and negative examples through the model, and collect the activations of the attention heads, $\{\mathbf{o}_h^l\}_{h \in [H], l \in [L]}$, at the last token. +
+ ++For each of the $L \times H$ head locations, we train a logistic regression probe on the $D$-dimensional activations to predict whether it's a positive or negative example. Then we pick the attention heads with the highest probing accuracies as the localized heads. +
+ +
+For the selected head at $(l, h)$, we find the direction $\mathbf{u}_h^l$ that is "best" at separating the activations of positive and negative examples. There are several variants, but according to
+More specifically, the applied intervention is: +
+ + +$$ +\mathbf{r}^{l+1}_{\text{ITI}} := \mathbf{r}^l + W^l ( \mathbf{o} + \alpha \boldsymbol{\theta}^l) +$$ + + +$$ += \mathbf{r}^{l+1}_{\text{orig}} + \alpha W^l \boldsymbol{\theta}^l = \mathbf{r}^{l+1}_{\text{orig}} + \alpha \sum_{h=1}^H W_h^l \boldsymbol{\theta}_h^l +$$ + ++where $\mathbf{\theta}_l$ is the concatenated intervention vectors across all heads at layer $l$, and $\alpha$ is the intervention strength. This intervention is repeated for each next token prediction autoregressively until the whole answer is completed. +
+ +
+Since the goal is to assess model's generation quality, it's natural to use truthfulness score and informativeness score of generations as the evaluation metrics. They use GPT-judge models
+We also report other metrics as in the ITI paper: KL divergence of the model's next-token prediction distribution post- versus pre-intervention, and multiple-choice accuracy (MC) which is determined via comparing the conditional probabilities of candidate answers given the question. +
+ ++In ITI, the authors find that editing on 16 localized heads (out of a total of 1024 heads) successfully steers model generations to be more truthful while still being informative. They also find intervening on all attention heads doesn't make model generations more truthful than intervening just at the localized heads. This seems to suggest that the truthfulness concept is indeed encoded in the localized heads. +
+ +
+We now strengthen this evidence further. Similar to
+1. Randomly select 16 heads, and compute intervention vectors $\theta$'s accordingly. +
+ ++2. Apply varying intervention strength α, collect model generations, and compute scores for truthfulness and informativeness using GPT-judge across all intervention strengths. +
+ ++3. Repeat for 16 times. +
+ ++We find that interventions at the localized heads are more effective than interventions at random heads. In Figure below we report the Info*Truth score (average truthfulness score times average informativeness score). We find that using localized heads have significantly higher Truth*Info scores than using random heads (p-value $1.6×10^{-8}$). In fact, using random heads often doesn't have noticeable effect on the truthfulness at all, as shown in the truth-info plot (Figure) and KL-MC trade-off plot (Figure). +
+ ++This appears to add further evidence that the localized heads are "special" for the truthfulness concept. However, this strong association could be because the intervention and localization are "correlated", since both use statistics of the same activations (determined by the design of the data, etc). E.g. for heads with very low probing accuracy, the estimated intervention vectors could be very noisy, and thus the interventions could be less effective. +
+ ++To test whether a particular behavior is localized to specific location, we would like to assess the effect of the optimal intervention at that location. In the case of our running example, we want the localized edit to the representation space that does the best job of steering the model's generations to be more truthful while maintaining informativeness. Then, the questions are: what is the best we could hope to achieve? (I.e., what is "optimal"?) And, (how) can we find a localized edit that achieves it? +
+ +
+The key observation is that the dataset used to construct positive and negative examples can be restructured as paired "preference" data $\{(x_i, y_i^+, y_i^-)\}_i$, where $x_i$ is the question, $y_i^+$ is the truthful answer, and $y_i^-$ is the untruthful answer. Since the goal is to make model generations more truthful, we can directly adopt contrastive alignment methods for biasing the model towards the truthful answers. In this case, we use the IPO
+$ +\text{argmax}_{\phi} \sum_i \left[\log \left( \frac{\pi_{\phi}(y_i^{+} | x_i)}{\pi_0(y_i^{+} | x_i)} / \frac{\pi_{\phi}(y_i^{-} | x_i)}{\pi_0(y_i^{-} | x_i)} \right) - \frac{\tau^{-1}}{2}\right]^2 +$ +
+ ++where $\pi_\phi(\cdot \vert x)$ is the model's generation probability, $\pi_0(\cdot \vert x)$ is the original model's generation probability, and $\tau$ decides the threshold. Ideally, the optimized $\pi_{\phi^*}(\cdot \vert x)$ should generate responses that are more truthful than the original model, while minimally affecting the off-target aspects of the generation (in this case, the informativeness of the responses). +
+ +
+To test the effectiveness of IPO alignment, we finetune the weights for project-out matrices $W^l$'s defined in Equation 3 using (rank 1) LoRA
+Now we treat this result as the overall best performance that we can achieve with interventions. We want to see if optimal interventions at localized heads can achieve the same performance, and if random heads can achieve the same performance. +
+ ++The connection to IPO lets us search for the best possible update to the model's weights. However, we are interested in localized edits to model representations. To continue, we need to connect the weight editing to representation editing. +
+ ++Directly applying rank-1 LoRA to W^l, we can view the effect of adding in the modified LoRA weight matrix as an edit to the representation as follows: +
+ ++$ +\mathbf{r}_{\text{LoRA}}^{l+1} := \mathbf{r}^l + (W^l + \mathbf{b}^l (\mathbf{a}^l)^T)\mathbf{o}^l = \mathbf{r}_{\text{orig}}^{l+1} + \left< \mathbf{a}^l, \mathbf{o}^l\right> \mathbf{b}^l +$ +
+ +
+where $a^l, b^l$ are the LoRA weights to optimize. Comparing with the ITI intervention , we see that $b^l$ plays the role of the added $W^l \theta^l$, and $\langle a^l, o^l \rangle$ is the intervention strength but is adapted to the representation $\mathbf{o}$.
+This formulation connects weight edits to representation edits. However, it doesn't yet allow us to localize edits to specific heads -- while $\theta^l$ can be read as concatenation of headwise intervention vectors, the projected $W^l \theta^l$ have no corresponding interpretations. Therefore, we can't restrict the edits to specific heads by imposing structure on $b^l$'s. +
+ ++We can make more direct connections by reparameterizing $b^l$ with $W^l b^l$ (without changing expressiveness): +
+ ++$ +\mathbf{r}_{\text{LoRA-reparam}}^{l+1} := \mathbf{r}_{\text{orig}}^{l+1} + \left< \mathbf{a}^l, \mathbf{o}^l\right> W^l \mathbf{b}^l = \mathbf{r}_{\text{orig}}^{l+1} + \left< \mathbf{a}^l, \mathbf{o}^l\right> \sum_{h=1}^H W_h^l \mathbf{b}_h^l +$ +
+ ++Here $b_h^l$ plays the role of the intervention vector $\theta_h^l$, and $a^l$ decides the intervention strength adaptively. +
+ ++Now we have the algorithm to find the optimal interventions for the chosen set of heads: +
+ ++1. Finetune the model weights using reparameterized LoRA with the IPO objective. +
+ ++2. And, restrict $b^l$ to be nonzero only for the chosen set of heads. +
+ ++We can now search for the best possible interventions at the localized heads. The earlier figure shows the result. We find that the optimal interventions strongly outperform the heuristic ITI interventions. Moreover, the localized interventions are about as effective as full IPO alignment! This appears to be the strongest edit-based evidence for localization that we could hope for. +
+ ++Now, we apply the same optimal edit procedure to 16 randomly selected heads. The figure above shows the results. In short: the optimal interventions at random heads are often just as effective as the optimal interventions at the localized heads. Accordingly, the fact that editing at the localized heads was effective at steering generations provides no evidence that the truthfulness concept is localized to those heads. +
+ ++Further, the random heads we use here are the same random heads used earlier. Using the ITI heuristic intervention, the selected heads looked highly different from these random heads. But we now see that this appears to be an artifact of the suboptimal interventions and choice of metric, rather than a meaningful difference in how the heads relate to truthfulness. +
+ ++It is now clear that edit-based evidence does not provide strong evidence for localization in the 16 head setup. However, a possible way of saving localization would be to argue that 16 heads is too many, giving too much leeway to induce any behavior we want with editing. For example, if we edited half the heads of the model, it would not be surprising if we could make the model do anything we wanted. Accordingly, we might hope that there is still a valid syllogism of the form "the localized edit is extremely constrainted" and "edits at this location optimally control the target behavior" implies "the target behavior is localized to this location". +
+ ++To test this, we now focus on the single head case. The procedure is simple: we randomly sample 24 single heads, one at a time, and search for optimal interventions. The distribution of the best Truth*Info scores is shown in the figure above. We find 5 single-heads that are as effective, and none of them has high probing accuracy. Notice that, still, none of these heads can be understood as localizing the truthfulness concept. The reason is that there are multiple distinct locations that work equally well! That is, even in the extreme case of a very localized edit that replicates the target behavior essentially optimally, we still cannot conclude that there is evidence supporting localization. +
+ ++So far what we mean by localization, is that we can change model generation on target concept by an edits at this location. And our experiments show no evidence for this type of localization, and probing-localized heads play no special role. +
+ ++So, are the probing-localized heads anything special at all? +
+ ++We do observe that these heads achieve slightly better Multiple-Choice (MC) scores compared to randomly selected heads (see figures below), although this advantage is not as pronounced as with the ITI interventions (see earlier figures). Thus, these heads may be special in terms of changing model probabilities on the given fixed dataset, which is what MC measures. +
+ +
+It's important to note that the model's probabilities for fixed responses, do not directly correspond to what the model actually generates. Even if the model assigns a higher probability to a truthful response than an untruthful one, it may still not generate the truthful response if the fixed dataset is off-policy (i.e. both probabilities are low). This highlights the well-known gap between what a model "knows" (which is the motivation behind probing) and what it ultimately generates
+It's possible that while probing-localized heads are not special at all for controlling model generations, they are special in changing what the model "knows". Though we caution that the results here are not rigorous evidence for localization even in this sense. Even if there is a knowledge localization in some sense, it is clear that this does not inform steering, and does not give a way of monitoring model behavior (because changes in completely unrelated locations can change the behavior). This points to the need for making the goal of localization precise. +
+ ++The main idea in this paper is that to assess the localization of a behavior we should study the effect of the optimal intervention at the conjectured localization. The main obstacle is that, in general, it is not clear how to define or find the optimal intervention. To overcome this, we map the problem of finding the optimal intervention to the problem of finding the optimal weight update, which can be solved using existing LLM alignment methods. +
+ ++The main result is an example where, naively, the evidence for localization appears strong, but when we use optimal interventions, the evidence disappears. +
+ ++The particular example—truthfulness and ITI-based evidence—was selected simply because the data used to define the heuristic happens to also allow us to set up a contrastive alignment problem. The most limited read of the results here is that ITI interventions do not provide evidence for localization, and that truthfulness does not appear to be localizable. However, the broader point is that by giving an example where editing-based evidence doesn't support localization, we see that in general such edits—by themselves—cannot provide evidence for localization. This is true irrespective of the particular behavior or heuristic being evaluated. +
+ ++Thus far, we've been a bit vague about what localization means. Editing does tautological evidence for localization in the sense of "it's possible to modify model behavior on such-and-such a behavior by an edit at this location". On the opposite end, the strongest possible standard would be to show that the location is unique, or at least necessary. This is the standard that would be required if our aim was, e.g., to establish that LLM truthfulness can be monitored by examining a small set of heads. Potentially, there are interesting and useful notions of localization in between these two extremes. However, we can see no useful sense of localization that is consistent with the location being only as good as a randomly selected alternative. As we have seen, heuristic edit-based evaluation cannot even rule out this case. +
+ +
+Our findings add to a growing body of work that assesses the validity of interpretability results.
+Overall, the results here point to the need for precise statements of what the objectives are in interpretability. With clear objectives, it may be possible to develop theoretically grounded methods for evaluation. Precise, falsifiable, statements and clear standards of evidence would suffice to prevent the kind of failure we observe in this paper. +
+ +
+We use the TruthfulQA dataset
+We use IPO objective
+We reuse code from ITI