Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rejection sampling for top_p etc #963

Open
mmoskal opened this issue Dec 2, 2024 · 0 comments
Open

rejection sampling for top_p etc #963

mmoskal opened this issue Dec 2, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@mmoskal
Copy link
Contributor

mmoskal commented Dec 2, 2024

Currently, sample_sequence() first does rejection sampling (ie. checking if token is allowed after sampling it) and only if this fails, computes the full mask. This is the same as llama.cpp.

This is equivalent to computing the full mask and then sampling from masked logits for arg-max and temperature sampling, but not for top_p and top_k (and possibly other sampling methods).

Initially, chatgpt told me but after thinking about it a bit, I'm convinced it's right.

Possible courses of action:

  • ignore it (it's probably close enough)
  • only do it for temp and arg-max
  • always compute mask

Note that llguidance now has interface for a cheaper check if an element is allowed than what I did in #899 - I can try to get that in at some point unless we go with the last option above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant