Goal sample in FB loss #15

Simon-Reif · 2025-02-11T13:42:54Z

In the code the observation at the next timestep is used as the "goal" sample for the FB loss (fb_cpr/agent.py line 192, fb/agent.py line 160). That seems to contradict the algorithm in the paper (p. 21, line 21) and I don't really understand how this can still work.

Miffyli · 2025-02-12T22:48:46Z

Hey! I am not quite the confusion is (although I guess the "goal" term here is partially to blame). The goal state is the true future state of s, for which we want to maximize the successor measure (product of forward and backward). This ends up being the diagonal here when the matmul is calculated over the whole sample. All off-diagonal samples are then used as the negative pairs (i.e., s likely did not lead to s'), and for those the successor measure should be small, and ends up being minimized .

Did that clarify/explain the situation?

Simon-Reif · 2025-02-13T16:06:43Z

Thanks a lot for the quick response! :D
I think I understand what I overlooked now and the code indeed fits the paper.

What I don't quite understand is the interpretation of the off-diag "goals" as negative samples. The way I understand it, the first (off-diag) term of the FB loss is a kind of temporal difference loss with "goals" being arbitrary possible future states; so that the succession measure defined by F,B is moved closer to the discounted expected succession measure at the next time step.

Miffyli · 2025-02-13T18:14:46Z

Ah yes that is a good question! ~~Notice that there is three "FB" terms in the equation in question, two using "negative" pairs and one using "positive" pairs.~~

1) The first FB term is to minimize successor measure M for negative pairs.
2) The second term is, as you pointed out, to do the TD-like connection between successive states. I personally do not have a good answer for why this is done with negative pairs instead of positive pairs (iirc had to do something when you derive things out).
3) The third term is to maximize successor measure for positive pairs

Edit: Above is complete potato. There are only two terms (the TD-like loss of consecutive steps + increasing the value for positive pairs).

Simon-Reif · 2025-02-14T01:18:05Z

I'm a bit confused.
The first two terms are being squared, so I don't see how their sign matters. :/
The way I understood this is that the first line is the TD-like term, supposed to move the successor measure at time i to the discounted, expected successor measure at time i+1.
And that should hold for any z and "goal" state.
The second line is to maximize the successor measure for the observed transition and I guess this should hold for any z since the policy doesn't matter here.

With "positive pairs" you mean observed consecutive states in F and B? In my intuition without arbitrary "goal states" in the TD loss (but only as negative samples), this would only learn something like transition probabilites.

Anyways, great that you guys are so active in responding to questions. :)

Miffyli · 2025-02-14T10:27:07Z

Ah yes you are right (the squaring invalidates what I said completely). Teaches me to pay proper attention and check my notes on these topics before answering 😅 . Indeed it only has the TD-like loss using the negative pairs + positive loss.

I pinged the master of FBs to give you a solid answer on this instead of more confusion. Stand by!

ahmed-touati · 2025-02-14T11:01:21Z

hey @Simon-Reif, thanks for your interest, indeed your intuition is correct:

the squared term is td-term where we back-propagate "our guess" of the successor measure at state s_{t+1} to s_{t}.
the second line maximizes the value for the observed transition.

Algebraically speaking, the fb loss provides the same gradient as the Bellman-like loss

$$ || F_z^\top B - P / \rho - \gamma P^{\pi_z} \text{SG}(F_z^\top B) ||_{\rho}^2 $$

where norm $$|| \cdot ||_{\rho}^2$$ is $$\rho$$-weighted squared norm, $$\rho$$ data distribution, $$\text{SG} $$ is the stop-gradient. Developing further the loss, we obtain

$$|| F_z^\top B - P / \rho - \gamma P^{\pi_z} \text{SG}(F_z^\top B) ||_ \rho^2 = || F_z^\top B - \gamma P^{\pi_z} \text{SG}(F_z^\top B) || \rho^2 - 2 < F_z^\top B, P >_ \rho + \text{constant term}$$.

where $$\text{constant term}$$ doesn't depend on the optimization parameters of F, B.

You can see that the first term of FB actual empirical loss corresponds to the boostsrap term $$|| F_z^\top B - \gamma P^{\pi_z} \text{SG}(F_z^\top B) ||_\rho^2$$, while the second term corresponds to the inner-product term with the transition kernel $$P$$.

when $$\gamma = 0$$, the FB-loss becomes a "standard" contrastive loss where the squared term $$|| F^\top B ||_ \rho^2$$ is now "a repulsive" term that minimizes the alignement between the representations of negative pairs (s, s' iid) (pushing the inner product to be equal zero) and the second term is an "attractive" term that pushes up the alignement between representations of positive pairs ( s' is the next state of s). this case can be useful for example if you are interested to learn a low-rank approximation of the transition kernel P.

feel free it you have further questions.

Simon-Reif · 2025-02-17T13:39:16Z

@Miffyli no worries, you cleared up my confusion about how the algorithm was implemented in the code. :)
@ahmed-touati ah, good to hear that I had roughly the right understanding there. Thanks for the detailed response! At the moment I don't have further questions, but I guess I'll ask here in the Github issues when something comes up.

skylooop · 2025-02-19T13:00:06Z

Hello @ahmed-touati , thanks for answer regarding FB loss. But I do not see why s+ from the paper for B can't be future state on the trajectory? (I.e if we have offline dataset, where trajectories are meaningful and s+ is reachable from s). Based on your answers, it is not clear for me whether goal as next_observation is crucial for FB training. If not, how does performance changes when sampling goal for B on the trajectory.

In most task-agnostic representation learning algorithms (e.g ICVF Reinforcement Learning from Passive Data via Latent Intentions Ghosh et al.) they sample goals with certain probability either at random from dataset or randomly from trajectory.
Maybe I am misunderstanding something
Thanks!

Miffyli added the question Further information is requested label Feb 12, 2025

Miffyli closed this as completed Feb 18, 2025

Miffyli reopened this Feb 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Goal sample in FB loss #15

Goal sample in FB loss #15

Simon-Reif commented Feb 11, 2025

Miffyli commented Feb 12, 2025

Simon-Reif commented Feb 13, 2025

Miffyli commented Feb 13, 2025 •

edited

Loading

Simon-Reif commented Feb 14, 2025

Miffyli commented Feb 14, 2025

ahmed-touati commented Feb 14, 2025 •

edited

Loading

Simon-Reif commented Feb 17, 2025

skylooop commented Feb 19, 2025 •

edited

Loading

Goal sample in FB loss #15

Goal sample in FB loss #15

Comments

Simon-Reif commented Feb 11, 2025

Miffyli commented Feb 12, 2025

Simon-Reif commented Feb 13, 2025

Miffyli commented Feb 13, 2025 • edited Loading

Simon-Reif commented Feb 14, 2025

Miffyli commented Feb 14, 2025

ahmed-touati commented Feb 14, 2025 • edited Loading

Simon-Reif commented Feb 17, 2025

skylooop commented Feb 19, 2025 • edited Loading

Miffyli commented Feb 13, 2025 •

edited

Loading

ahmed-touati commented Feb 14, 2025 •

edited

Loading

skylooop commented Feb 19, 2025 •

edited

Loading