-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Goal sample in FB loss #15
Comments
Hey! I am not quite the confusion is (although I guess the "goal" term here is partially to blame). The Did that clarify/explain the situation? |
Thanks a lot for the quick response! :D What I don't quite understand is the interpretation of the off-diag "goals" as negative samples. The way I understand it, the first (off-diag) term of the FB loss is a kind of temporal difference loss with "goals" being arbitrary possible future states; so that the succession measure defined by F,B is moved closer to the discounted expected succession measure at the next time step. |
I'm a bit confused. With "positive pairs" you mean observed consecutive states in F and B? In my intuition without arbitrary "goal states" in the TD loss (but only as negative samples), this would only learn something like transition probabilites. Anyways, great that you guys are so active in responding to questions. :) |
Ah yes you are right (the squaring invalidates what I said completely). Teaches me to pay proper attention and check my notes on these topics before answering 😅 . Indeed it only has the TD-like loss using the negative pairs + positive loss. I pinged the master of FBs to give you a solid answer on this instead of more confusion. Stand by! |
hey @Simon-Reif, thanks for your interest, indeed your intuition is correct:
Algebraically speaking, the fb loss provides the same gradient as the Bellman-like loss where norm where You can see that the first term of FB actual empirical loss corresponds to the boostsrap term when feel free it you have further questions. |
@Miffyli no worries, you cleared up my confusion about how the algorithm was implemented in the code. :) |
Hello @ahmed-touati , thanks for answer regarding FB loss. But I do not see why s+ from the paper for B can't be future state on the trajectory? (I.e if we have offline dataset, where trajectories are meaningful and s+ is reachable from s). Based on your answers, it is not clear for me whether goal as next_observation is crucial for FB training. If not, how does performance changes when sampling goal for B on the trajectory. In most task-agnostic representation learning algorithms (e.g ICVF Reinforcement Learning from Passive Data via Latent Intentions Ghosh et al.) they sample goals with certain probability either at random from dataset or randomly from trajectory. |
In the code the observation at the next timestep is used as the "goal" sample for the FB loss (fb_cpr/agent.py line 192, fb/agent.py line 160). That seems to contradict the algorithm in the paper (p. 21, line 21) and I don't really understand how this can still work.
The text was updated successfully, but these errors were encountered: