Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about the hw-modulated attention in DAB-DETR #193

Open
Artificial-Inability opened this issue Feb 1, 2023 · 4 comments
Open
Assignees

Comments

@Artificial-Inability
Copy link

I have two questions about hw-modulated attention equation (Eq.(6) in DAB-DETR):

  1. Why use 1/wq and 1/hq instead of wq and hq? Does that mean an anchor with larger width will result in a narrower shape attention map in x direction?
  2. DAB-DETR has already updated the 4D anchor in each decoder layer using the embedding of last laryer through MLP, why we still need wref and href which are also generated using the embedding of last layer through MLP? Is that necessary?
@SlongLiu
Copy link
Contributor

SlongLiu commented Feb 1, 2023

  1. 1/wq can make sure that attention maps have a similar shape as the anchor boxes. For example, a large w can result in a flatten attention map in the x direction under the 1/wq formulation. We provide some visualizations in our paper.

  2. href and wref are designed to keep the same dimension with hq and wq. It helps to the final performance.

@Artificial-Inability
Copy link
Author

  1. 1/wq can make sure that attention maps have a similar shape as the anchor boxes. For example, a large w can result in a flatten attention map in the x direction under the 1/wq formulation. We provide some visualizations in our paper.

Could you give a more detailed explanation about how this works? My personal understanding of the "H=1, W=3" in Figure 6 of the DAB paper is that "href/hq = 1, wref/wq = 3", in which larger wq will lead to smaller W. If I misunderstood something, what is the definition of H and W in Figure6? Thanks.

@SlongLiu
Copy link
Contributor

SlongLiu commented Feb 2, 2023

  1. 1/wq can make sure that attention maps have a similar shape as the anchor boxes. For example, a large w can result in a flatten attention map in the x direction under the 1/wq formulation. We provide some visualizations in our paper.

Could you give a more detailed explanation about how this works? My personal understanding of the "H=1, W=3" in Figure 6 of the DAB paper is that "href/hq = 1, wref/wq = 3", in which larger wq will lead to smaller W. If I misunderstood something, what is the definition of H and W in Figure6? Thanks.

The results in Fig 6 are examples. "H=1, W=3" means hq =1, wq = 3. We suppose the href and wref are 1.

@Artificial-Inability
Copy link
Author

  1. 1/wq can make sure that attention maps have a similar shape as the anchor boxes. For example, a large w can result in a flatten attention map in the x direction under the 1/wq formulation. We provide some visualizations in our paper.

Could you give a more detailed explanation about how this works? My personal understanding of the "H=1, W=3" in Figure 6 of the DAB paper is that "href/hq = 1, wref/wq = 3", in which larger wq will lead to smaller W. If I misunderstood something, what is the definition of H and W in Figure6? Thanks.

The results in Fig 6 are examples. "H=1, W=3" means hq =1, wq = 3. We suppose the href and wref are 1.

I couldn't understand this phenomenon theoretically. If the origin value of attention map at a fix point is calculated by (PE(x)*PE(xref)wref/wq + ... When we increase wq to wq'=3wq, the new value should decrease, which will result in a narrower shape attention map. Could you explain why larger wq leads to wider atten map with the formulation theoretically? Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants