-
Notifications
You must be signed in to change notification settings - Fork 569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add LogitTrackingProcessor #1408
base: main
Are you sure you want to change the base?
Conversation
There's a few Current issues: outlines/processors/tracking.py:22: error: Library stubs not installed for "pandas" [import-untyped]
outlines/processors/tracking.py:22: note: Hint: "python3 -m pip install pandas-stubs"
outlines/processors/tracking.py:22: note: (or run "mypy --install-types" to install all missing stub packages)
outlines/processors/tracking.py:22: note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-imports
outlines/processors/tracking.py:105: error: Item "list[Any]" of "Any | Any | list[Any] | Any" has no attribute "shape" [union-attr]
outlines/processors/tracking.py:108: error: Item "list[Any]" of "Any | Any | list[Any] | Any" has no attribute "shape" [union-attr]
outlines/processors/tracking.py:275: error: Value of type "dict[str, list[Any] | Any] | None" is not indexable [index]
outlines/processors/tracking.py:276: error: Value of type "dict[str, list[Any] | Any] | None" is not indexable [index]
outlines/processors/tracking.py:508: error: "SequenceGenerator" has no attribute "logits_processor" [attr-defined]
outlines/processors/tracking.py:512: error: "SequenceGenerator" has no attribute "logits_processor" [attr-defined]
outlines/processors/tracking.py:515: error: "SequenceGenerator" has no attribute "logits_processor" [attr-defined]
outlines/processors/tracking.py:516: error: "LogitTrackingProcessor" has no attribute "tokenizer" [attr-defined]
outlines/processors/tracking.py:516: error: "SequenceGenerator" has no attribute "logits_processor" [attr-defined]
outlines/processors/tracking.py:519: error: "SequenceGenerator" has no attribute "logits_processor" [attr-defined] |
outlines/processors/tracking.py
Outdated
return "".join(tokenizer.decode(tokens_to_decode)) | ||
|
||
|
||
def add_tracking(generator: "SequenceGenerator") -> "SequenceGenerator": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would name this track_logits
instead.
- Token decoding requires the wrapped processor to have a tokenizer attribute | ||
- Memory usage grows linearly with sequence length | ||
- The tracking processor only supports single-batch processing | ||
- Tracking logits can incur significant overhead -- do not use it in production environments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add an example of how we can use these logits processors directly with e.g. transformer pipes
Looks good to me, it's a great addition. I just have a few minor comments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great addition! I left a few comments that need to be addressed before merging.
I've addressed the comments, appreciated! I have some remaining code style issues that are kind of out of my knowledge at the moment, if people have tips on these I'd love them. Otherwise I'll have to come back to this in a week or two. outlines/processors/tracking.py:242: error: No overload variant of "__getitem__" of "list" matches argument type "tuple[slice[None, None, None], int]" [call-overload]
outlines/processors/tracking.py:242: note: Possible overload variants:
outlines/processors/tracking.py:242: note: def __getitem__(self, SupportsIndex, /) -> Any
outlines/processors/tracking.py:242: note: def __getitem__(self, slice[Any, Any, Any], /) -> list[Any]
outlines/processors/tracking.py:243: error: No overload variant of "__getitem__" of "list" matches argument type "tuple[slice[None, None, None], int]" [call-overload]
outlines/processors/tracking.py:243: note: Possible overload variants:
outlines/processors/tracking.py:243: note: def __getitem__(self, SupportsIndex, /) -> Any
outlines/processors/tracking.py:243: note: def __getitem__(self, slice[Any, Any, Any], /) -> list[Any]
outlines/processors/tracking.py:246: error: Value of type "dict[str, Any | Any | list[Any] | Any] | None" is not indexable [index]
outlines/processors/tracking.py:246: error: No overload variant of "__getitem__" of "list" matches argument type "tuple[slice[None, None, None], int]" [call-overload]
outlines/processors/tracking.py:246: note: Possible overload variants:
outlines/processors/tracking.py:246: note: def __getitem__(self, SupportsIndex, /) -> Any
outlines/processors/tracking.py:246: note: def __getitem__(self, slice[Any, Any, Any], /) -> list[Any]
outlines/processors/tracking.py:247: error: Value of type "dict[str, Any | Any | list[Any] | Any] | None" is not indexable [index]
outlines/processors/tracking.py:247: error: No overload variant of "__getitem__" of "list" matches argument type "tuple[slice[None, None, None], int]" [call-overload]
outlines/processors/tracking.py:247: note: Possible overload variants:
outlines/processors/tracking.py:247: note: def __getitem__(self, SupportsIndex, /) -> Any
outlines/processors/tracking.py:247: note: def __getitem__(self, slice[Any, Any, Any], /) -> list[Any]
outlines/processors/tracking.py:348: error: Library stubs not installed for "pandas" [import-untyped]
outlines/processors/tracking.py:348: note: Hint: "python3 -m pip install pandas-stubs"
outlines/processors/tracking.py:348: note: (or run "mypy --install-types" to install all missing stub packages)
outlines/processors/tracking.py:348: note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-imports
outlines/processors/tracking.py:368: error: Item "list[Any]" of "Any | Any | list[Any] | Any" has no attribute "shape" [union-attr]
outlines/processors/tracking.py:369: error: No overload variant of "__getitem__" of "list" matches argument type "tuple[slice[None, None, None], int]" [call-overload]
outlines/processors/tracking.py:369: note: Possible overload variants:
outlines/processors/tracking.py:369: note: def __getitem__(self, SupportsIndex, /) -> Any
outlines/processors/tracking.py:369: note: def __getitem__(self, slice[Any, Any, Any], /) -> list[Any]
outlines/processors/tracking.py:370: error: No overload variant of "__getitem__" of "list" matches argument type "tuple[slice[None, None, None], int]" [call-overload]
outlines/processors/tracking.py:370: note: Possible overload variants:
outlines/processors/tracking.py:370: note: def __getitem__(self, SupportsIndex, /) -> Any
outlines/processors/tracking.py:370: note: def __getitem__(self, slice[Any, Any, Any], /) -> list[Any]
outlines/processors/tracking.py:428: error: Item "None" of "OutlinesLogitsProcessor | None" has no attribute "tokenizer" [union-attr]
outlines/processors/tracking.py:472: error: "SequenceGenerator" has no attribute "logits_processor" [attr-defined]
outlines/processors/tracking.py:476: error: "SequenceGenerator" has no attribute "logits_processor" [attr-defined]
outlines/processors/tracking.py:479: error: "SequenceGenerator" has no attribute "logits_processor" [attr-defined]
outlines/processors/tracking.py:480: error: "SequenceGenerator" has no attribute "logits_processor" [attr-defined]
outlines/processors/tracking.py:483: error: "SequenceGenerator" has no attribute "logits_processor" [attr-defined]
tests/test_types.py:37: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs [annotation-unchecked]
tests/test_types.py:93: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs [annotation-unchecked]
tests/test_prompts.py:222: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs [annotation-unchecked]
tests/test_prompts.py:223: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs [annotation-unchecked]
tests/test_prompts.py:233: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs [annotation-unchecked]
tests/test_prompts.py:234: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs [annotation-unchecked]
tests/test_prompts.py:243: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs [annotation-unchecked]
tests/test_prompts.py:244: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs [annotation-unchecked]
tests/test_function.py:16: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs [annotation-unchecked]
tests/processors/test_tracking.py:4: error: Library stubs not installed for "pandas" [import-untyped] |
I can take a look. |
I fixed the formatting issues. I'll do a little refactoring and then we'll be good to merge. |
As an idle thought here -- is there an interface available to us where we could wrap the resulting generated object with the logits, rather than store it in the logit processor as I have here? Currently we only return strings from Might be a "kick it down the road" thing, but curious if @rlouf @torymur @RobinPicard had an idea of whether this would be simple to do. If it is simple, I could try refactoring this code to store logits in a response value. |
This PR adds
LogitTrackingProcessor
, a logit processor that wraps around any other processor to store unstructured and structured logits through the sampled sequence. I needed this code elsewhere and figured it is popular enough to upstream intooutlines
.I have included documentation on the processors, which doesn't exist currently. Tests are included as well.
LogitTrackingProcessor
makes it easy to perform analysis on disagreements between structured and unstructured tokens. It will be of benefit to researchers, educators, and users who wish to debug their Outlines generators.An example plot for a regex requiring four digits. This is the distribution of token probabilities on the first token.
Using the tracker is simple:
NOTE: Currently, the tracking processor does not support batch processing. I recommend deferring this to a later PR.
Related:
outlines.processors
for Sampling Techniques and Debug Logging #1055 as this can be used as a debugging tool