You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I directly used the Adamw optimizer for backpropagation and found that the value of output a kept decreasing and was less than 1.
May I ask if I used the entmax method incorrectly?
The text was updated successfully, but these errors were encountered:
This usage is not incorrect per se, but it might not get you exactly what you're looking for. In general, it's useful to apply some constraint on what values alpha can take. In Correia et al., 2019, they parameterized it essentially like this:
This guarantees that the alpha value will always be on the interval (1, 2) (in other words, somewhere between softmax and sparsemax). In principle you could constrain it in other ways as well -- to my knowledge, no one has explored this in much depth.
thank you! Your answer is very helpful for me in interdisciplinary research.
Does alpha have a significant impact on sparsity? I asked alpha to update with the same optimizer using the same learning rate for all parameters, and found that the amplitude of alpha changes between 0.002-0.1. I am considering whether to set a separate learning rate for alpha so that it can take big strides.
self.alpha = torch.nn.Parameter(torch.tensor(1.33))
attention_probs = entmax_bisect(attention_scores, alpha=self.alpha, dim=-1)
I directly used the Adamw optimizer for backpropagation and found that the value of output a kept decreasing and was less than 1.
May I ask if I used the entmax method incorrectly?
The text was updated successfully, but these errors were encountered: