Question about encoder choice for downstream tasks #86

gritYCDA · 2025-01-17T15:39:14Z

Question about encoder choice for downstream tasks

In the video classification evaluation code, I noticed that the target encoder (y-encoder) is being used for downstream tasks instead of the context encoder (x-encoder). This seems different from other self-supervised learning approaches:

Most SSL methods like MOCO, SimCLR, and BYOL use their main/query encoder for downstream tasks rather than the momentum/target encoder.
In V-JEPA, the y-encoder has stop_gradient applied during training, which intuitively suggests the x-encoder might be more suitable for downstream tasks since it learns to predict comprehensive features from partial information.

Looking at the implementation, I noticed the following code for loading checkpoints:

checkpoint = torch.load(pretrained, map_location='cpu')
try:
    pretrained_dict = checkpoint[checkpoint_key]
except Exception:
    pretrained_dict = checkpoint['encoder']

While the code primarily uses the target_encoder (through checkpoint_key), it seems there's a fallback option to use 'encoder'. This suggests that using the context encoder might still be possible, though not prioritized.

I'm curious about:

Was there experimental evidence showing that the target encoder consistently performs better than the context encoder for downstream tasks?
If so, was this the reason for prioritizing the target encoder in the implementation?
Are there specific characteristics of V-JEPA that make the target encoder more suitable for downstream tasks, unlike other SSL approaches?

Would appreciate any insights into this design choice. Thanks!

tsagkas · 2025-02-14T11:52:16Z

Hello @gritYCDA ! Did you manage to figure this out?

rolson24 · 2025-02-17T18:43:59Z

Hi @gritYCDA

I think the reason they use the target encoder is because in this paper they actually want to do downstream tasks that have access to the full input. They test on downstream tasks like action recognition and classification which you would use a full video (or image) for and the target encoder is better at encoding the full information. If the goal was to use the model for a downstream task like video generation or reconstruction, then you would use the input encoder and the predictor to do the generation or reconstruction. It is kind of a weird paper because the model they actually want to use (target encoder) is not trained directly at all (it's a momentum updated version of the target encoder) and they basically just throw away the models that they did actively train (the target encoder and the predictor).

They do have a paper here that was posted in December, 2024 where they train a much smaller version for video prediction on MovingMNIST and there they test using the input encoder and the predictor.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about encoder choice for downstream tasks #86

Question about encoder choice for downstream tasks #86

gritYCDA commented Jan 17, 2025

tsagkas commented Feb 14, 2025

rolson24 commented Feb 17, 2025

Question about encoder choice for downstream tasks #86

Question about encoder choice for downstream tasks #86

Comments

gritYCDA commented Jan 17, 2025