Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Goal sample in FB loss #15

Open
Simon-Reif opened this issue Feb 11, 2025 · 8 comments
Open

Goal sample in FB loss #15

Simon-Reif opened this issue Feb 11, 2025 · 8 comments
Labels
question Further information is requested

Comments

@Simon-Reif
Copy link

In the code the observation at the next timestep is used as the "goal" sample for the FB loss (fb_cpr/agent.py line 192, fb/agent.py line 160). That seems to contradict the algorithm in the paper (p. 21, line 21) and I don't really understand how this can still work.

@Miffyli
Copy link
Contributor

Miffyli commented Feb 12, 2025

Hey! I am not quite the confusion is (although I guess the "goal" term here is partially to blame). The goal state is the true future state of s, for which we want to maximize the successor measure (product of forward and backward). This ends up being the diagonal here when the matmul is calculated over the whole sample. All off-diagonal samples are then used as the negative pairs (i.e., s likely did not lead to s'), and for those the successor measure should be small, and ends up being minimized .

Did that clarify/explain the situation?

@Miffyli Miffyli added the question Further information is requested label Feb 12, 2025
@Simon-Reif
Copy link
Author

Thanks a lot for the quick response! :D
I think I understand what I overlooked now and the code indeed fits the paper.

What I don't quite understand is the interpretation of the off-diag "goals" as negative samples. The way I understand it, the first (off-diag) term of the FB loss is a kind of temporal difference loss with "goals" being arbitrary possible future states; so that the succession measure defined by F,B is moved closer to the discounted expected succession measure at the next time step.

@Miffyli
Copy link
Contributor

Miffyli commented Feb 13, 2025

Ah yes that is a good question! Notice that there is three "FB" terms in the equation in question, two using "negative" pairs and one using "positive" pairs.

Image

1) The first FB term is to minimize successor measure M for negative pairs.
2) The second term is, as you pointed out, to do the TD-like connection between successive states. I personally do not have a good answer for why this is done with negative pairs instead of positive pairs (iirc had to do something when you derive things out).
3) The third term is to maximize successor measure for positive pairs

Edit: Above is complete potato. There are only two terms (the TD-like loss of consecutive steps + increasing the value for positive pairs).

@Simon-Reif
Copy link
Author

I'm a bit confused.
The first two terms are being squared, so I don't see how their sign matters. :/
The way I understood this is that the first line is the TD-like term, supposed to move the successor measure at time i to the discounted, expected successor measure at time i+1.
And that should hold for any z and "goal" state.
The second line is to maximize the successor measure for the observed transition and I guess this should hold for any z since the policy doesn't matter here.

With "positive pairs" you mean observed consecutive states in F and B? In my intuition without arbitrary "goal states" in the TD loss (but only as negative samples), this would only learn something like transition probabilites.

Anyways, great that you guys are so active in responding to questions. :)

@Miffyli
Copy link
Contributor

Miffyli commented Feb 14, 2025

Ah yes you are right (the squaring invalidates what I said completely). Teaches me to pay proper attention and check my notes on these topics before answering 😅 . Indeed it only has the TD-like loss using the negative pairs + positive loss.

I pinged the master of FBs to give you a solid answer on this instead of more confusion. Stand by!

@ahmed-touati
Copy link

ahmed-touati commented Feb 14, 2025

hey @Simon-Reif, thanks for your interest, indeed your intuition is correct:

  • the squared term is td-term where we back-propagate "our guess" of the successor measure at state s_{t+1} to s_{t}.
  • the second line maximizes the value for the observed transition.

Algebraically speaking, the fb loss provides the same gradient as the Bellman-like loss

$$ || F_z^\top B - P / \rho - \gamma P^{\pi_z} \text{SG}(F_z^\top B) ||_{\rho}^2 $$

where norm $$|| \cdot ||_{\rho}^2$$ is $$\rho$$-weighted squared norm, $$\rho$$ data distribution, $$\text{SG} $$ is the stop-gradient. Developing further the loss, we obtain

$$|| F_z^\top B - P / \rho - \gamma P^{\pi_z} \text{SG}(F_z^\top B) ||_ \rho^2 = || F_z^\top B - \gamma P^{\pi_z} \text{SG}(F_z^\top B) || \rho^2 - 2 < F_z^\top B, P >_ \rho + \text{constant term}$$.

where $$\text{constant term}$$ doesn't depend on the optimization parameters of F, B.

You can see that the first term of FB actual empirical loss corresponds to the boostsrap term $$|| F_z^\top B - \gamma P^{\pi_z} \text{SG}(F_z^\top B) ||_\rho^2$$, while the second term corresponds to the inner-product term with the transition kernel $$P$$.

when $$\gamma = 0$$, the FB-loss becomes a "standard" contrastive loss where the squared term $$|| F^\top B ||_ \rho^2$$ is now "a repulsive" term that minimizes the alignement between the representations of negative pairs (s, s' iid) (pushing the inner product to be equal zero) and the second term is an "attractive" term that pushes up the alignement between representations of positive pairs ( s' is the next state of s). this case can be useful for example if you are interested to learn a low-rank approximation of the transition kernel P.

feel free it you have further questions.

@Simon-Reif
Copy link
Author

@Miffyli no worries, you cleared up my confusion about how the algorithm was implemented in the code. :)
@ahmed-touati ah, good to hear that I had roughly the right understanding there. Thanks for the detailed response! At the moment I don't have further questions, but I guess I'll ask here in the Github issues when something comes up.

@Miffyli Miffyli closed this as completed Feb 18, 2025
@skylooop
Copy link

skylooop commented Feb 19, 2025

Hello @ahmed-touati , thanks for answer regarding FB loss. But I do not see why s+ from the paper for B can't be future state on the trajectory? (I.e if we have offline dataset, where trajectories are meaningful and s+ is reachable from s). Based on your answers, it is not clear for me whether goal as next_observation is crucial for FB training. If not, how does performance changes when sampling goal for B on the trajectory.

In most task-agnostic representation learning algorithms (e.g ICVF Reinforcement Learning from Passive Data via Latent Intentions Ghosh et al.) they sample goals with certain probability either at random from dataset or randomly from trajectory.
Maybe I am misunderstanding something
Thanks!

@Miffyli Miffyli reopened this Feb 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants