Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark on encoder representation as comparison #1

Open
MarvinT opened this issue Oct 11, 2018 · 8 comments
Open

Benchmark on encoder representation as comparison #1

MarvinT opened this issue Oct 11, 2018 · 8 comments

Comments

@MarvinT
Copy link

MarvinT commented Oct 11, 2018

It would be nice to run a MLP on the encoder representation to compare the representation learned by the unsupervised encoder in comparison to the full CPC model representation.

@davidtellez
Copy link
Owner

davidtellez commented Oct 14, 2018

I don't fully understand your question, there is just one (1) encoder trained via CPC that can encode the patches. What do you mean by "the unsupervised encoder" and "the full CPC model representation"? Let me summarize what I did just to clarify:

  1. Trained the CPC model to distinguish between number sequences (this step trains the encoder that lives within the CPC model).
  2. Once the CPC model was trained, I read the encoder network only (discarding the rest of the CPC model).
  3. I took the encoder, froze its weights and added an MLP on top of it (see here). Then I trained this encoder+MLP to distinguish numbers. Because it achieved 90%, I concluded that the encoder learned useful features to describe the numbers.

I hope this clarifies a bit your question, please reply if you meant something else. Thanks for dropping by!

@MarvinT
Copy link
Author

MarvinT commented Oct 14, 2018

Sorry, I mis-remembered the paper for some reason. I thought the network_encoder or g_enc in the paper was pre trained as a VAE, not that the whole network was trained end to end. I guess I'm interested in the encoder network compared against the features learned by a VAE of similar architecture.

@davidtellez
Copy link
Owner

I see, no problem. In the original paper, they compare CPC with other methods, not VAE though. I have some code for VAE from another project so I might run the experiment you mention if I get some free time. I'll keep you posted.

@N-Kingsley
Copy link

I want to ask two questions.

  1. Does this code not apply formula (2)(3)?
  2. the calculation of 'loss' does not seem to be NCE in the section 2.2 of the paper?

@davidtellez
Copy link
Owner

Let's focus on equation (3) in section 2.2. It describes how to measure prediction error, and this is what happens:

  • The context c_t is mapped to a predicted image embedding using a linear layer with parameters W_k: W_k.c_t. Let's call the resulting vector p_{t+k}. This happens in my code here. Note that this can be any function of your choice, but a linear layer is used for simplicity.
  • The prediction p_{t+k} is compared with the vector embedding of the real image z_{t+k}. This comparison is done via dot product in the formula (that's why z_{t+k} is transposed). This operation produces a high value if both vectors are "similar" and a low value if they are "not similar". Because we get a similarity score for each k, I average the score across the temporal dimension. This happens in my code here.
  • An exponential operation is applied to the previous similarity score. I use a sigmoid to limit the values of the score to the [0, 1] range here.

At this point, we can measure semantic similarity between our predictions and the actual data. Our data contains two kinds of sequences, actually two labels. Positive labels correspond to sorted sequences and negative labels correspond to non-sorted sequences.

For the positive labels, we want CPC to predict sorted sequences that produce high similarity scores, in our case a 1. For the negative labels, we want CPC to predict non-sorted sequences that produce low similarity scores, in our case a 0.

As they propose in the paper in section 2.3, all we need to do to train CPC is apply binary cross-entropy loss between the similarity scores and the labels, done here.

I hope this helps understanding my implementation. Beware that this is my own interpretation of this paper, which might or might not be completely correct.

@N-Kingsley
Copy link

Oh, you explained too clearly so that I fully understand. Thank you for your help very much.

@N-Kingsley
Copy link

By the way, is the equation (4) in section 2.3 the ‘binary_crossentropy’ in your code?

@babbu3682
Copy link

Why did you use binary_crossentropy?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants