Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run TIP-adapter on text2img retrieval instead #8

Open
adrielkuek opened this issue Sep 8, 2022 · 3 comments
Open

Run TIP-adapter on text2img retrieval instead #8

adrielkuek opened this issue Sep 8, 2022 · 3 comments

Comments

@adrielkuek
Copy link

Hi, thanks for the amazing work on adapters on CLIP. Currently the framework computes the affinities between the test query image and the cache keys, before obtaining the corresponding few-shot label. This works well and good. I would just like your advise on how can i extend this to text2img retrieval where I would like to query with text search term, and utilise the cache key-value adapter to return corresponding images. Would it be as naive as to do a text to text embedding affinity matching of the query text with the cache VALUES (instead of keys) as they contain the ground truth labels for the few-shot learning?

@ZrrSkywalker
Copy link
Collaborator

Thanks for your interest!
I suppose if the query and values are within the same embedding space, e.g., both text features, they can directly calculate the affinity matching and produce the final output without keys. Tip-Adapter uses keys as the representative since the cached values are one-hot encodings that differ from the query image features.

@adrielkuek
Copy link
Author

@ZrrSkywalker thanks for the reply. Yea, I did a work-around by using the text to cache value affinity, and retrieve the corresponding images associated with the one-hot labels in the dataset. Works pretty neat and fine! I have another question though. Would it be possible for the cache model to have an unbalance K-shot images? In some of the real-world usage, we would get varying k-shots as exemplar images for training. Just wondering how would we build the cache-key matrix with differing K-values for the different classes?

@ZrrSkywalker
Copy link
Collaborator

That is a quite insightful question. I tried on some datasets with varying K for different categories. Generally, a larger K leads to higher classification accuracy for the corresponding category. This can also be used to tackle some long-tail issues, e.g., setting larger K for the sample-insufficient categories to balance the learned network.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants