Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add interface to Guide object to update masks in place, and associated kernels. #183

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

unaidedelf8777
Copy link
Contributor

towards resolving #178

On the guide object, I added the write_mask_into method. This method takes 3 arguments:

  • data_ptr: pointer to the start of the contiguous memory for the array
  • numel: number of elements in the array
  • element_size: size in bytes of each element in the array. This is checked to be 4, since we only support u32 arrays. If it is not 4, a ValueError is thrown.

In a mask array, each u32 represents the validity of 32 tokens ( one per bit ). Additionally, masks must also be stored in contiguous memory, in order for Rust to access and modify them.

Currently, kernels for both torch and numpy are implemented. The numpy kernels require an additional dependency on numba in order to bring runtime down to around 40 microseconds ( 1 mask, 1 logits array ). runtime for the torch kernel with 1 mask and 1 logits array is half of numpy, at ~23 microseconds per run, mostly due to torch.compile. The form of the numpy kernel is not final as of now; It will be updated to have better scaling and vectorized ops. If I can without hurting performance ( or if you all would like ), I will remove the dependency on numba.

All kernels reside in the outlines_core.kernels submodule, with dependencies for each kernel imported dynamically in a try - except instead of being added to the package dependencies.

TODO:

  • write_into_mask method on Guide
  • numpy kernels
  • torch kernels
  • mlx kernels
  • Final pass over kernels. Make sure they scale to larger batch sizes well.
  • Tests
  • Add Benchmarks to ASV.

Please feel free to critique any of this.

@unaidedelf8777
Copy link
Contributor Author

unaidedelf8777 commented Feb 25, 2025

Benchmarks for write_into_mask as of now, tested with the unsloth/Llama-3.1-8B-Instruct tokenizer:

Benchmarking regex: "[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?"
Time to write mask for regex: 93 useconds
Num Allowed tokens: 25510

Benchmarking regex: '\\\\+?[1-9][0-9]{7,14}'
Time to write mask for regex: 11 useconds
Num Allowed tokens: 4

Benchmarking regex: '([1-9]|0[1-9]|1[0-9]|2[0-9]|3[0-1])(\\.|-|/)([1-9]|0[1-9]|1[0-2])(\\.|-|/)([0-9][0-9]|19[0-9][0-9]|20[0-9][0-9])|([0-9][0-9]|19[0-9][0-9]|20[0-9][0-9])(\\.|-|/)([1-9]|0[1-9]|1[0-2])(\\.|-|/)([1-9]|0[1-9]|1[0-9]|2[0-9]|3[0-1])'
Time to write mask for regex: 8 useconds
Num Allowed tokens: 130

Benchmarking regex: '(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)'
Time to write mask for regex: 10 useconds
Num Allowed tokens: 366

Benchmarking regex: '(https?:\\/\\/)?([\\da-z\\.-]+)\\.([a-z\\.]{2,6})([\\/\\w \\.-]*)*\\/?'
Time to write mask for regex: 84 useconds
Num Allowed tokens: 23277

Benchmarking regex: '\\d{3}-\\d{2}-\\d{4}'
Time to write mask for regex: 13 useconds
Num Allowed tokens: 1222

@rlouf
Copy link
Member

rlouf commented Feb 25, 2025

Awesome! Do you have some profiling results that show the time spend on each operation across the whole chain?


# This takes roughly 23 microseconds per run, with a bitmask of
# 1k allowed tokens, and 128k logits tensor.
# Also compiles to one graph with no graph breaks
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way to access the CUDA code generated by PyTorch? It might be over-engineering for now, but I'd like to get an idea of how efficient that code is and if there are gains to be had there in the future.

Copy link
Contributor Author

@unaidedelf8777 unaidedelf8777 Feb 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems possible - just have to find the temp directory where it dumps it: https://pytorch.org/tutorials/intermediate/inductor_debug_cpu.html

@unaidedelf8777
Copy link
Contributor Author

Awesome! Do you have some profiling results that show the time spend on each operation across the whole chain?

@rlouf For Rust, Kernels, Or Both?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants