You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Question about encoder choice for downstream tasks
In the video classification evaluation code, I noticed that the target encoder (y-encoder) is being used for downstream tasks instead of the context encoder (x-encoder). This seems different from other self-supervised learning approaches:
Most SSL methods like MOCO, SimCLR, and BYOL use their main/query encoder for downstream tasks rather than the momentum/target encoder.
In V-JEPA, the y-encoder has stop_gradient applied during training, which intuitively suggests the x-encoder might be more suitable for downstream tasks since it learns to predict comprehensive features from partial information.
Looking at the implementation, I noticed the following code for loading checkpoints:
While the code primarily uses the target_encoder (through checkpoint_key), it seems there's a fallback option to use 'encoder'. This suggests that using the context encoder might still be possible, though not prioritized.
I'm curious about:
Was there experimental evidence showing that the target encoder consistently performs better than the context encoder for downstream tasks?
If so, was this the reason for prioritizing the target encoder in the implementation?
Are there specific characteristics of V-JEPA that make the target encoder more suitable for downstream tasks, unlike other SSL approaches?
Would appreciate any insights into this design choice. Thanks!
The text was updated successfully, but these errors were encountered:
I think the reason they use the target encoder is because in this paper they actually want to do downstream tasks that have access to the full input. They test on downstream tasks like action recognition and classification which you would use a full video (or image) for and the target encoder is better at encoding the full information. If the goal was to use the model for a downstream task like video generation or reconstruction, then you would use the input encoder and the predictor to do the generation or reconstruction. It is kind of a weird paper because the model they actually want to use (target encoder) is not trained directly at all (it's a momentum updated version of the target encoder) and they basically just throw away the models that they did actively train (the target encoder and the predictor).
They do have a paper here that was posted in December, 2024 where they train a much smaller version for video prediction on MovingMNIST and there they test using the input encoder and the predictor.
Question about encoder choice for downstream tasks
In the video classification evaluation code, I noticed that the target encoder (y-encoder) is being used for downstream tasks instead of the context encoder (x-encoder). This seems different from other self-supervised learning approaches:
Most SSL methods like MOCO, SimCLR, and BYOL use their main/query encoder for downstream tasks rather than the momentum/target encoder.
In V-JEPA, the y-encoder has stop_gradient applied during training, which intuitively suggests the x-encoder might be more suitable for downstream tasks since it learns to predict comprehensive features from partial information.
Looking at the implementation, I noticed the following code for loading checkpoints:
While the code primarily uses the target_encoder (through checkpoint_key), it seems there's a fallback option to use 'encoder'. This suggests that using the context encoder might still be possible, though not prioritized.
I'm curious about:
Would appreciate any insights into this design choice. Thanks!
The text was updated successfully, but these errors were encountered: