-
@dan and I had some confusion (#22) of what exactly this embedding learnt and the effect of averaging all 64 embeddings into one. So here I'm trying to reason it. Please @weiji14 and @srmsoumya, check me. TL;DR It's mostly probably safe to use average. My understanding is that:
If the above is correct, it means that the embeddings of each window have learned to predict the content of the surrounding windows, which makes most window embeddings similar. This reduces the effect when we average all 64 window embeddings into a single chip embedding. The similarity can be checked on the embeddings example @weiji14 shared. Min cosine similarity is For example:
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
@brunosan your understanding of how the encoder side of MAE works is spot on. In our modified architecture, we are making two changes for now:
Transformers have a concept called the For the example you shared, let's say we completely remove the small red building from the image and ask the MAE to recreate the original image. In this case, there is less chance of the MAE adding a building back to the image. Although we have absolute lat/lon and time information available as embeddings, we can expect the network to learn a general understanding of the information rather than very specific features. However, I might be totally wrong here. Training a large model might actually encode such granular features and be able to recreate them. We have to test and try that out to see. |
Beta Was this translation helpful? Give feedback.
@brunosan your understanding of how the encoder side of MAE works is spot on.
In our modified architecture, we are making two changes for now:
Transformers have a concept called the
cls
token, which is used to capture a generic vector representation of the input space (EO imagery in our case). This idea is borrowed from the BERT paper and is commonly used in Vision Transformers. We can choose to use the embeddings from thecls
token, which represents what the…