You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The DN DETR architecture employs static queries, while DINO uses mixed query selection. Later, masked DINO reverted back to using the pure query selection of deformable DETR. In the context of this DETR architecture, is there any further research or explanation on which content and position or anchor query pairs should be used during the decoding process?
The text was updated successfully, but these errors were encountered:
For detection, using learnable content query could be better. Mask DINO mainly focuses on segmentation that is deeply related to content query, so we use selected content query.
May I ask if you have any follow-up research on the topic? For example, content and anchor queries from the encoder, along with some learnable embeddings, can be integrated in a variety of ways.
In DAB-DETR, a complex design for anchor queries is used in both self and cross attentions. However, in DINO, you discarded the design and just compared no, pure, and mixed query selections. Could you please explain why this change was made?
The DN DETR architecture employs static queries, while DINO uses mixed query selection. Later, masked DINO reverted back to using the pure query selection of deformable DETR. In the context of this DETR architecture, is there any further research or explanation on which content and position or anchor query pairs should be used during the decoding process?
The text was updated successfully, but these errors were encountered: