You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, sample_sequence() first does rejection sampling (ie. checking if token is allowed after sampling it) and only if this fails, computes the full mask. This is the same as llama.cpp.
This is equivalent to computing the full mask and then sampling from masked logits for arg-max and temperature sampling, but not for top_p and top_k (and possibly other sampling methods).
Initially, chatgpt told me but after thinking about it a bit, I'm convinced it's right.
Possible courses of action:
ignore it (it's probably close enough)
only do it for temp and arg-max
always compute mask
Note that llguidance now has interface for a cheaper check if an element is allowed than what I did in #899 - I can try to get that in at some point unless we go with the last option above.
The text was updated successfully, but these errors were encountered:
Currently,
sample_sequence()
first does rejection sampling (ie. checking if token is allowed after sampling it) and only if this fails, computes the full mask. This is the same as llama.cpp.This is equivalent to computing the full mask and then sampling from masked logits for arg-max and temperature sampling, but not for top_p and top_k (and possibly other sampling methods).
Initially, chatgpt told me but after thinking about it a bit, I'm convinced it's right.
Possible courses of action:
Note that llguidance now has interface for a cheaper check if an element is allowed than what I did in #899 - I can try to get that in at some point unless we go with the last option above.
The text was updated successfully, but these errors were encountered: