You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a question regarding the scaling and balance between CLIP logits and cache logits in the tip-adapter implementation. Specifically, I'm looking at the following code:
clip_logits = 100. * val_features @ clip_weights
Here, val_features and clip_weights are L2-normalized vectors. The resulting clip_logits has a range of [-100, 100] due to the 100x scaling factor.
val_features and cache_keys are also L2-normalized. The affinity values range from [-1, 1].
The expression - (beta - beta * affinity) leads to a range of [-2*beta, 0], which is then exponentiated, yielding values in the range (0, 1].
The primary concern is that clip_logits and cache_logits are not on the same scale. clip_logits ranges between [-100, 100], while adapter A is mostly in (0, 1]. This discrepancy might affect the effective fusion of these logits in the model, as seen in tip_logits = clip_logits + cache_logits * alpha.
Given that alpha is typically a single-digit number, I'm wondering if this difference in scale is intended or if there might be a need for additional scaling or normalization to align these logits more effectively? Any insights or suggestions would be greatly appreciated.
Thank you for your time and assistance!
The text was updated successfully, but these errors were encountered:
I'm also curious how the logit scale between clip_logits and cache_logits is well mitigated.
From the dataset I'm dealing with now, it seems the actual value of clip_logits spans around 15~20, whereas the value of cache_logits is between 0 and 1 theoretically stated by @Aikoin , which means the scale of each logit actually differs.
Hello,
I have a question regarding the scaling and balance between CLIP logits and cache logits in the tip-adapter implementation. Specifically, I'm looking at the following code:
clip_logits = 100. * val_features @ clip_weights
Here,
val_features
andclip_weights
are L2-normalized vectors. The resultingclip_logits
has a range of [-100, 100] due to the 100x scaling factor.cache_logits = ((-1) * (beta - beta * affinity)).exp() @ cache_values
affinity = val_features @ cache_keys
val_features
andcache_keys
are also L2-normalized. The affinity values range from [-1, 1].The expression
- (beta - beta * affinity)
leads to a range of [-2*beta, 0], which is then exponentiated, yielding values in the range (0, 1].The primary concern is that
clip_logits
andcache_logits
are not on the same scale.clip_logits
ranges between [-100, 100], while adapterA
is mostly in (0, 1]. This discrepancy might affect the effective fusion of these logits in the model, as seen intip_logits = clip_logits + cache_logits * alpha
.Given that alpha is typically a single-digit number, I'm wondering if this difference in scale is intended or if there might be a need for additional scaling or normalization to align these logits more effectively? Any insights or suggestions would be greatly appreciated.
Thank you for your time and assistance!
The text was updated successfully, but these errors were encountered: