rejection sampling for `top_p` etc #963

mmoskal · 2024-12-02T17:21:19Z

Currently, sample_sequence() first does rejection sampling (ie. checking if token is allowed after sampling it) and only if this fails, computes the full mask. This is the same as llama.cpp.

This is equivalent to computing the full mask and then sampling from masked logits for arg-max and temperature sampling, but not for top_p and top_k (and possibly other sampling methods).

Initially, chatgpt told me but after thinking about it a bit, I'm convinced it's right.

Possible courses of action:

ignore it (it's probably close enough)
only do it for temp and arg-max
always compute mask

Note that llguidance now has interface for a cheaper check if an element is allowed than what I did in #899 - I can try to get that in at some point unless we go with the last option above.

The text was updated successfully, but these errors were encountered:

mmoskal added the bug Something isn't working label Dec 2, 2024

This was referenced Dec 2, 2024

parallel computation of mask in constrained sampling #964

Open

use llguidance library for constraints (including json schemas) #899

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rejection sampling for `top_p` etc #963

rejection sampling for `top_p` etc #963

mmoskal commented Dec 2, 2024

rejection sampling for top_p etc #963

rejection sampling for top_p etc #963

Comments

mmoskal commented Dec 2, 2024

rejection sampling for `top_p` etc #963

rejection sampling for `top_p` etc #963