Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LogitTrackingProcessor #1408

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Add LogitTrackingProcessor #1408

wants to merge 1 commit into from

Conversation

cpfiffer
Copy link
Contributor

@cpfiffer cpfiffer commented Feb 8, 2025

This PR adds LogitTrackingProcessor, a logit processor that wraps around any other processor to store unstructured and structured logits through the sampled sequence. I needed this code elsewhere and figured it is popular enough to upstream into outlines.

I have included documentation on the processors, which doesn't exist currently. Tests are included as well.

LogitTrackingProcessor makes it easy to perform analysis on disagreements between structured and unstructured tokens. It will be of benefit to researchers, educators, and users who wish to debug their Outlines generators.

An example plot for a regex requiring four digits. This is the distribution of token probabilities on the first token.

image

Using the tracker is simple:

from outlines import generate, models
from outlines.processors import add_tracking
from pydantic import BaseModel
import pandas as pd

model = models.transformers("HuggingFaceTB/SmolLM2-135M-Instruct")
tokenizer = model.tokenizer.tokenizer

class Person(BaseModel):
    name: str
    age: int

# Create generator with tracking
generator = generate.json(model, Person)

# Convenience wrapper to add tracking
generator = add_tracking(generator)

# Apply templating
prompt = tokenizer.apply_chat_template(
    [{"role": "system", "content": "You are a helpful assistant, responding in JSON."}, {"role": "user", "content": "Make me a person with a name, age, zip code, and state. Return the JSON only."}],
    tokenize=False,
    add_bos=True,
    add_generation_prompt=True,
)

# Generate the response
generator(prompt)

# Retrieve the top-k tokens
top_k = generator.logits_processor.get_top_tokens(k=5)

# Get unstructured logits
for position_dict in top_k:
    position_dict['position'] # 0,1,2, etc
    position_dict['text_so_far'] # Text at this point in the sequence

    for token in position_dict['tokens']:
        token['token'] # The token
        token['unstructured_prob'] # Probability of the token in the unstructured distribution
        token['structured_prob'] # Probability of the token in the structured distribution
        token['unstructured_logit'] # Logit of the token in the unstructured distribution
        token['structured_logit'] # Logit of the token in the structured distribution
        token['is_chosen'] # Whether the token was actually sampled

# Convert to dataframe
df = generator.logits_processor.to_dataframe(show="probs", min_value=0.01)
#    position token   natural  constrained  chosen
# 0         0   You  0.021324          0.0   False
# 1         0   The  0.021959          0.0   False
# 2         0  Sure  0.025492          0.0   False
# 3         0  JSON  0.031045          0.0   False
# 4         0    To  0.031047          0.0   False

# Get the token sequence up to position 5
generator.logits_processor.sequence(5)

NOTE: Currently, the tracking processor does not support batch processing. I recommend deferring this to a later PR.

Related:

@cpfiffer cpfiffer marked this pull request as ready for review February 14, 2025 18:25
@cpfiffer
Copy link
Contributor Author

There's a few mypy-related tests to resolve but I think the meat of this is ready for review.

Current issues:

outlines/processors/tracking.py:22: error: Library stubs not installed for "pandas"  [import-untyped]
outlines/processors/tracking.py:22: note: Hint: "python3 -m pip install pandas-stubs"
outlines/processors/tracking.py:22: note: (or run "mypy --install-types" to install all missing stub packages)
outlines/processors/tracking.py:22: note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-imports
outlines/processors/tracking.py:105: error: Item "list[Any]" of "Any | Any | list[Any] | Any" has no attribute "shape"  [union-attr]
outlines/processors/tracking.py:108: error: Item "list[Any]" of "Any | Any | list[Any] | Any" has no attribute "shape"  [union-attr]
outlines/processors/tracking.py:275: error: Value of type "dict[str, list[Any] | Any] | None" is not indexable  [index]
outlines/processors/tracking.py:276: error: Value of type "dict[str, list[Any] | Any] | None" is not indexable  [index]
outlines/processors/tracking.py:508: error: "SequenceGenerator" has no attribute "logits_processor"  [attr-defined]
outlines/processors/tracking.py:512: error: "SequenceGenerator" has no attribute "logits_processor"  [attr-defined]
outlines/processors/tracking.py:515: error: "SequenceGenerator" has no attribute "logits_processor"  [attr-defined]
outlines/processors/tracking.py:516: error: "LogitTrackingProcessor" has no attribute "tokenizer"  [attr-defined]
outlines/processors/tracking.py:516: error: "SequenceGenerator" has no attribute "logits_processor"  [attr-defined]
outlines/processors/tracking.py:519: error: "SequenceGenerator" has no attribute "logits_processor"  [attr-defined]

return "".join(tokenizer.decode(tokens_to_decode))


def add_tracking(generator: "SequenceGenerator") -> "SequenceGenerator":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would name this track_logits instead.

- Token decoding requires the wrapped processor to have a tokenizer attribute
- Memory usage grows linearly with sequence length
- The tracking processor only supports single-batch processing
- Tracking logits can incur significant overhead -- do not use it in production environments
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add an example of how we can use these logits processors directly with e.g. transformer pipes

@rlouf
Copy link
Member

rlouf commented Feb 15, 2025

Looks good to me, it's a great addition. I just have a few minor comments

Copy link
Member

@rlouf rlouf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great addition! I left a few comments that need to be addressed before merging.

@cpfiffer
Copy link
Contributor Author

I've addressed the comments, appreciated!

I have some remaining code style issues that are kind of out of my knowledge at the moment, if people have tips on these I'd love them. Otherwise I'll have to come back to this in a week or two.

outlines/processors/tracking.py:242: error: No overload variant of "__getitem__" of "list" matches argument type "tuple[slice[None, None, None], int]"  [call-overload]
outlines/processors/tracking.py:242: note: Possible overload variants:
outlines/processors/tracking.py:242: note:     def __getitem__(self, SupportsIndex, /) -> Any
outlines/processors/tracking.py:242: note:     def __getitem__(self, slice[Any, Any, Any], /) -> list[Any]
outlines/processors/tracking.py:243: error: No overload variant of "__getitem__" of "list" matches argument type "tuple[slice[None, None, None], int]"  [call-overload]
outlines/processors/tracking.py:243: note: Possible overload variants:
outlines/processors/tracking.py:243: note:     def __getitem__(self, SupportsIndex, /) -> Any
outlines/processors/tracking.py:243: note:     def __getitem__(self, slice[Any, Any, Any], /) -> list[Any]
outlines/processors/tracking.py:246: error: Value of type "dict[str, Any | Any | list[Any] | Any] | None" is not indexable  [index]
outlines/processors/tracking.py:246: error: No overload variant of "__getitem__" of "list" matches argument type "tuple[slice[None, None, None], int]"  [call-overload]
outlines/processors/tracking.py:246: note: Possible overload variants:
outlines/processors/tracking.py:246: note:     def __getitem__(self, SupportsIndex, /) -> Any
outlines/processors/tracking.py:246: note:     def __getitem__(self, slice[Any, Any, Any], /) -> list[Any]
outlines/processors/tracking.py:247: error: Value of type "dict[str, Any | Any | list[Any] | Any] | None" is not indexable  [index]
outlines/processors/tracking.py:247: error: No overload variant of "__getitem__" of "list" matches argument type "tuple[slice[None, None, None], int]"  [call-overload]
outlines/processors/tracking.py:247: note: Possible overload variants:
outlines/processors/tracking.py:247: note:     def __getitem__(self, SupportsIndex, /) -> Any
outlines/processors/tracking.py:247: note:     def __getitem__(self, slice[Any, Any, Any], /) -> list[Any]
outlines/processors/tracking.py:348: error: Library stubs not installed for "pandas"  [import-untyped]
outlines/processors/tracking.py:348: note: Hint: "python3 -m pip install pandas-stubs"
outlines/processors/tracking.py:348: note: (or run "mypy --install-types" to install all missing stub packages)
outlines/processors/tracking.py:348: note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-imports
outlines/processors/tracking.py:368: error: Item "list[Any]" of "Any | Any | list[Any] | Any" has no attribute "shape"  [union-attr]
outlines/processors/tracking.py:369: error: No overload variant of "__getitem__" of "list" matches argument type "tuple[slice[None, None, None], int]"  [call-overload]
outlines/processors/tracking.py:369: note: Possible overload variants:
outlines/processors/tracking.py:369: note:     def __getitem__(self, SupportsIndex, /) -> Any
outlines/processors/tracking.py:369: note:     def __getitem__(self, slice[Any, Any, Any], /) -> list[Any]
outlines/processors/tracking.py:370: error: No overload variant of "__getitem__" of "list" matches argument type "tuple[slice[None, None, None], int]"  [call-overload]
outlines/processors/tracking.py:370: note: Possible overload variants:
outlines/processors/tracking.py:370: note:     def __getitem__(self, SupportsIndex, /) -> Any
outlines/processors/tracking.py:370: note:     def __getitem__(self, slice[Any, Any, Any], /) -> list[Any]
outlines/processors/tracking.py:428: error: Item "None" of "OutlinesLogitsProcessor | None" has no attribute "tokenizer"  [union-attr]
outlines/processors/tracking.py:472: error: "SequenceGenerator" has no attribute "logits_processor"  [attr-defined]
outlines/processors/tracking.py:476: error: "SequenceGenerator" has no attribute "logits_processor"  [attr-defined]
outlines/processors/tracking.py:479: error: "SequenceGenerator" has no attribute "logits_processor"  [attr-defined]
outlines/processors/tracking.py:480: error: "SequenceGenerator" has no attribute "logits_processor"  [attr-defined]
outlines/processors/tracking.py:483: error: "SequenceGenerator" has no attribute "logits_processor"  [attr-defined]
tests/test_types.py:37: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
tests/test_types.py:93: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
tests/test_prompts.py:222: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
tests/test_prompts.py:223: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
tests/test_prompts.py:233: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
tests/test_prompts.py:234: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
tests/test_prompts.py:243: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
tests/test_prompts.py:244: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
tests/test_function.py:16: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
tests/processors/test_tracking.py:4: error: Library stubs not installed for "pandas"  [import-untyped]

@rlouf
Copy link
Member

rlouf commented Feb 22, 2025

I can take a look.

@rlouf
Copy link
Member

rlouf commented Feb 22, 2025

I fixed the formatting issues. I'll do a little refactoring and then we'll be good to merge.

@cpfiffer
Copy link
Contributor Author

As an idle thought here -- is there an interface available to us where we could wrap the resulting generated object with the logits, rather than store it in the logit processor as I have here?

Currently we only return strings from generator calls, but is there an obvious + simple interface for providing a Result(value=..., logits=...) object? My sense is that this isn't likely to be simple, but the devex would probably be better.

Might be a "kick it down the road" thing, but curious if @rlouf @torymur @RobinPicard had an idea of whether this would be simple to do.

If it is simple, I could try refactoring this code to store logits in a response value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants