Refactor the Transformers and TransformersVision models

RobinPicard · Feb 20, 2025 · ff6366c · ff6366c
1 parent cac20af
commit ff6366c
Show file tree

Hide file tree

Showing 9 changed files with 759 additions and 1,334 deletions.
diff --git a/docs/reference/models/transformers.md b/docs/reference/models/transformers.md
@@ -1,96 +1,63 @@
-# transformers
+# Transformers
 
 
 !!! Installation
 
-    You need to install the `transformer`, `datasets` and `torch` libraries to be able to use these models in Outlines, or alternatively:
+    You need to install the `transformer` library to be able to use these models in Outlines, or alternatively:
 
     ```bash
     pip install "outlines[transformers]"
     ```
 
+## Create a `Transformers` model
 
-Outlines provides an integration with the `torch` implementation of causal models in the [transformers][transformers] library. You can initialize the model by passing its name:
-
+The only mandatory argument to instantiate a `Transformers` model is the name of the model to use.
 ```python
 from outlines import models
 
-model = models.transformers("microsoft/Phi-3-mini-4k-instruct", device="cuda")
+model = models.Transformers("microsoft/Phi-3-mini-4k-instruct")
 ```
 
-If you need more fine-grained control you can also initialize the model and tokenizer separately:
+The model name must be a valid `transformers` model name. You can find a list of all in the HuggingFace library [here](https://huggingface.co/models).
+
+When instantiating a `Transformers` model as such, the class creates a model from the transformers libray using the class `AutoModelForCausalLM` by default (`transformers.AutoModelForCausalLM.from_pretrained(model_name, **model_kwargs)`).
 
+You can also provide keyword arguments in an optional `model_kwargs` parameter. Those will be passed to the `from_pretrained` method of the model class. One such argument is `device_map`, which allows you to specify the device on which the model will be loaded.
 
+For instance:
 ```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
 from outlines import models
 
-llm = AutoModelForCausalLM.from_pretrained("gpt2", output_attentions=True)
-tokenizer = AutoTokenizer.from_pretrained("gpt2")
-model = models.Transformers(llm, tokenizer)
+model = models.Transformers("microsoft/Phi-3-mini-4k-instruct", model_kwargs={"device_map": "cuda"})
 ```
 
-## Using Logits Processors
-
-There are two ways to use Outlines Structured Generation with HuggingFace Transformers:
+## Alternative model classes
 
-1. Use Outlines generation wrapper, `outlines.models.transformers`
-2. Use `OutlinesLogitsProcessor` with `transformers.AutoModelForCausalLM`
-
-Outlines supports a myriad of logits processors for structured generation. In these example, we will use the `RegexLogitsProcessor` which guarantees generated text matches the specified pattern.
-
-### Using `outlines.models.transformers`
+If the model you want to use is not compatible with `AutoModelForCausalLM`, you must provide a value for the `model_class` parameter. This value must be a valid `transformers` model class.
 
+For instance:
 ```python
-import outlines
-
-time_regex_pattern = r"(0?[1-9]|1[0-2]):[0-5]\d\s?(am|pm)?"
-
-model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct", device="cuda")
-generator = outlines.generate.regex(model, time_regex_pattern)
+from outlines import models
+from transformers import AutoModelForSeq2SeqLM
 
-output = generator("The the best time to visit a dentist is at ")
-print(output)
-# 2:30 pm
+model = models.Transformers("facebook/bart-large", model_class=AutoModelForSeq2SeqLM)
 ```
 
-### Using models initialized via the `transformers`  library
-
-```python
-import outlines
-import transformers
-
+When you instantiate a `Transformers` model, the class also creates a `Tokenizer` instance from the `AutoTokenizer` class. You can provide keyword arguments in an optional `tokenizer_kwargs` parameter. Those will be passed to the `from_pretrained` method of the tokenizer class as such: `tokenizer_class.from_pretrained(model_name, **tokenizer_kwargs)`.
 
-model_uri = "microsoft/Phi-3-mini-4k-instruct"
+Similarly, if your model is not compatible with `AutoTokenizer`, you must provide a value for the `tokenizer_class` parameter.
 
-outlines_tokenizer = outlines.models.TransformerTokenizer(
-    transformers.AutoTokenizer.from_pretrained(model_uri)
-)
-phone_number_logits_processor = outlines.processors.RegexLogitsProcessor(
-    "\\+?[1-9][0-9]{7,14}",  # phone number pattern
-    outlines_tokenizer,
-)
-
-generator = transformers.pipeline('text-generation', model=model_uri)
+```python
+from outlines import models
+from transformers import T5ForConditionalGeneration, T5Tokenizer
 
-output = generator(
-    "Jenny gave me her number it's ",
-	logits_processor=transformers.LogitsProcessorList([phone_number_logits_processor])
+model_pile_t5 = models.Transformers(
+    model_name="EleutherAI/pile-t5-large",
+    model_class=T5ForConditionalGeneration,
+    tokenizer_class=T5Tokenizer
 )
-print(output)
-# [{'generated_text': "Jenny gave me her number it's 2125550182"}]
-# not quite 8675309 what we expected, but it is a valid phone number
 ```
 
-[transformers]: https://github.com/huggingface/transformers
-
-
-## Alternative Model Classes
-
-`outlines.models.transformers` defaults to `transformers.AutoModelForCausalLM`, which is the appropriate class for most standard large language models, including Llama 3, Mistral, Phi-3, etc.
-
-However other variants with unique behavior can be used as well by passing the appropriate class.
-
 ### Mamba
 
 [Mamba](https://github.com/state-spaces/mamba) is a transformers alternative which employs memory efficient, linear-time decoding.
@@ -100,25 +67,14 @@ To use Mamba with outlines you must first install the necessary requirements:
 pip install causal-conv1d>=1.2.0 mamba-ssm torch transformers
 ```
 
-Then you can either create an Mamba-2 Outlines model via
-```python
-import outlines
-
-model = outlines.models.mamba("state-spaces/mamba-2.8b-hf")
-```
-
-or explicitly with
+Then you can create an `Mamba` Outlines model via:
 ```python
-import outlines
-from transformers import MambaForCausalLM
+from outlines import models
 
-model = outlines.models.transformers(
-    "state-spaces/mamba-2.8b-hf",
-    model_class=MambaForCausalLM
-)
+model = models.Mamba("state-spaces/mamba-2.8b-hf", model_kwargs={"device_map": "cuda"}, tokenizer_kwargs={"padding_side": "left"})
 ```
 
-
+Alternatively, you can use the `Transformers` class to create an `Mamba` model by providing the appropriate `model_class` and `tokenizer_class` arguments.
 
 Read [`transformers`'s documentation](https://huggingface.co/docs/transformers/en/model_doc/mamba) for more information.
 
@@ -128,21 +84,43 @@ You can use encoder-decoder (seq2seq) models like T5 and BART with Outlines.
 
 Be cautious with model selection though, some models such as `t5-base` don't include certain characters (`{`) and you may get an error when trying to perform structured generation.
 
-T5 Example:
+## Use the model to generate text
+
+Once you have created a `Transformers` model, you can use it to generate text by calling the instance of the model.
 ```python
-import outlines
-from transformers import AutoModelForSeq2SeqLM
+model("Hello, how are you?")
+```
 
-model_pile_t5 = outlines.models.transformers(
-    model_name="EleutherAI/pile-t5-large",
-    model_class=AutoModelForSeq2SeqLM,
-)
+You can also first create a `Generator` and then call it.
+```python
+from outlines import Generator
+
+generator = Generator(model)
+generator("Hello, how are you?")
 ```
 
-Bart Example:
+`Transformers` models typically support batching and the generation of several samples at once.
+
+For instance:
 ```python
-model_bart = outlines.models.transformers(
-    model_name="facebook/bart-large",
-    model_class=AutoModelForSeq2SeqLM,
-)
+model(["Hello, how are you?", "Respond with one word. Not more."], num_return_sequences=2, num_beams=2)
+```
+
+This would generate two sequences for each prompt, for a total of four sequences (two lists of 2 elements each in a list).
+
+## Use the model to generate structured data
+
+`Transformers` models can generate structured data by providing a value for the parameter `output_type` (the second positional argument of the `generate` method, right after the prompt).
+
+Supported types include `Json`, `Choice`, `Regex` and `CFG`.
+
+For instance:
+```python
+from outlines.types import Json
+from pydantic import BaseModel
+
+class Character(BaseModel):
+    name: str
+
+model("Create a character with a name.", Json(Character))
 ```
diff --git a/docs/reference/models/transformers_vision.md b/docs/reference/models/transformers_vision.md
@@ -2,33 +2,28 @@
 
 Outlines allows seamless use of [vision models](https://huggingface.co/learn/computer-vision-course/en/unit4/multimodal-models/tasks-models-part1).
 
-`outlines.models.transformers_vision` shares interfaces with, and is based on [outlines.models.transformers](./transformers.md).
+`outlines.models.Transformers_vision` shares interfaces with, and is based on [outlines.models.Transformers](./transformers.md).
 
-Tasks supported include
+## Create a `TransformersVision` model
 
-- image + text -> text
-- video + text -> text
+`TransformersVision` models inherit from `Transformers` and accept the same initialization parameters.
 
+In addition, they also accept the optional parameters `processor_class` and `processor_kwargs`. Those are used to create a `Processor` instance that is then used to preprocess the images. By default, `AutoProcessor` is used to create the processor as such: `AutoProcessor.from_pretrained(model_name, **processor_kwargs)`.
 
+If your model is not compatible with `AutoProcessor`, you must provide a value for the `processor_class` parameter.
+For instance:
+```python
+from outlines import models
+from transformers import CLIPModel, CLIPProcessor
 
-## Example: Using [Llava-Next](https://huggingface.co/docs/transformers/en/model_doc/llava_next) Vision Models
+model = models.TransformersVision("openai/clip-vit-base-patch32", model_class=CLIPModel, processor_class=CLIPProcessor)
+```
 
-Install dependencies
-`pip install torchvision pillow flash-attn`
+## Use the model to generate text from prompts and images
 
-Create the model
-```python
-import outlines
-from transformers import LlavaNextForConditionalGeneration
-
-model = outlines.models.transformers_vision(
-    "llava-hf/llava-v1.6-mistral-7b-hf",
-    model_class=LlavaNextForConditionalGeneration,
-	device="cuda",
-)
-```
+When calling the model, the prompt argument you provide must be a dictionary with a key `"prompts"` and a key `"images"`. The associated values must be a string or a list of strings for the prompts, and a PIL image or a list of PIL images for the images. Your prompts must include `<image>` tags tokens to indicate where the image should be inserted. There must be as many `<image>` tags as there are images.
 
-Create convenience function to load a `PIL.Image` from URL
+For easier use, we recommend you to create a convenience function to load a `PIL.Image` from URL.
 ```python
 from PIL import Image
 from io import BytesIO
@@ -39,56 +34,65 @@ def img_from_url(url):
     return Image.open(img_byte_stream).convert("RGB")
 ```
 
-### Describing an image
-
+You can then call the model with your prompts and images to generate text.
 ```python
-description_generator = outlines.generate.text(model)
-description_generator(
-    "<image> detailed description:",
-    [img_from_url("https://upload.wikimedia.org/wikipedia/commons/2/25/Siam_lilacpoint.jpg")]
-)
+from transformers import LlavaForConditionalGeneration
+from outlines import models
+
+model = models.TransformersVision("trl-internal-testing/tiny-LlavaForConditionalGeneration", model_class=LlavaForConditionalGeneration)
+prompt = {
+    "prompts": "<image> detailed description:",
+    "images": img_from_url("https://upload.wikimedia.org/wikipedia/commons/2/25/Siam_lilacpoint.jpg")
+}
+model(prompt)
 ```
 
-> This is a color photograph featuring a Siamese cat with striking blue eyes. The cat has a creamy coat and a light eye color, which is typical for the Siamese breed. Its features include elongated ears, a long, thin tail, and a striking coat pattern. The cat is sitting in an indoor setting, possibly on a cat tower or a similar raised platform, which is covered with a beige fabric, providing a comfortable and soft surface for the cat to rest or perch. The surface of the wall behind the cat appears to be a light-colored stucco or plaster.
-
-#### Multiple Images
-
-To include multiple images in your prompt you simply add more `<image>` tokens to the prompt
-
+You can include several images per prompt by adding more `<image>` tags to the prompt. Batching is also supported.
 ```python
-image_urls = [
-	"https://cdn1.byjus.com/wp-content/uploads/2020/08/ShapeArtboard-1-copy-3.png",  # triangle
-	"https://cdn1.byjus.com/wp-content/uploads/2020/08/ShapeArtboard-1-copy-11.png",  # hexagon
-]
-description_generator = outlines.generate.text(model)
-description_generator(
-    "<image><image>What shapes are present?",
-    list(map(img_from_url, image_urls)),
-)
+from transformers import LlavaForConditionalGeneration
+from outlines import models
+
+model = models.TransformersVision("trl-internal-testing/tiny-LlavaForConditionalGeneration", model_class=LlavaForConditionalGeneration)
+prompt = {
+    "prompts": ["<image><image>detailed description:", "<image><image>. What animals are present?"],
+    "images": [
+        img_from_url("https://upload.wikimedia.org/wikipedia/commons/2/25/Siam_lilacpoint.jpg"),
+        img_from_url("https://upload.wikimedia.org/wikipedia/commons/7/71/2010-kodiak-bear-1.jpg"),
+        img_from_url("https://upload.wikimedia.org/wikipedia/commons/2/25/Siam_lilacpoint.jpg"),
+        img_from_url("https://upload.wikimedia.org/wikipedia/commons/7/71/2010-kodiak-bear-1.jpg"),
+    ]
+}
+model(prompt)
 ```
 
-> There are two shapes present. One shape is a hexagon and the other shape is an triangle. '
+Here we have two prompts, each expecting two images. We correspondingly provide four images. This will generate two descriptions, one for each prompt.
+
+### Use the model for structured generation
 
+You can use the model to generate structured data by providing a value for the parameter `output_type` (the second positional argument of the `generate` method, right after the prompt).
 
-### Classifying an Image
+Supported types include `Json`, `Choice`, `Regex` and `CFG`.
 
+For instance to do classification, you can use the `Regex` type:
 ```python
-pattern = "Mercury|Venus|Earth|Mars|Saturn|Jupiter|Neptune|Uranus|Pluto"
-planet_generator = outlines.generate.regex(model, pattern)
+from outlines import models
+from outlines.types import Regex
+from transformers import LlavaForConditionalGeneration
 
-planet_generator(
-    "What planet is this: <image>",
-    [img_from_url("https://upload.wikimedia.org/wikipedia/commons/e/e3/Saturn_from_Cassini_Orbiter_%282004-10-06%29.jpg")]
-)
+model = models.TransformersVision("trl-internal-testing/tiny-LlavaForConditionalGeneration", model_class=LlavaForConditionalGeneration)
+pattern = "Mercury|Venus|Earth|Mars|Saturn|Jupiter|Neptune|Uranus|Pluto"
+prompt = {
+    "prompts": "<image>detailed description:",
+    "images": img_from_url("https://upload.wikimedia.org/wikipedia/commons/e/e3/Saturn_from_Cassini_Orbiter_%282004-10-06%29.jpg"),
+}
+model(prompt, Regex(pattern))
 ```
 
-> Saturn
-
-
-### Extracting Structured Image data
-
+Another example could be to generated a structured description of an image using the `Json` type:
 ```python
+from outlines import models
 from pydantic import BaseModel
+from transformers import LlavaForConditionalGeneration
 from typing import List, Optional
 
 class ImageData(BaseModel):
@@ -97,17 +101,15 @@ class ImageData(BaseModel):
     object_list: List[str]
     is_photo: bool
 
-image_data_generator = outlines.generate.json(model, ImageData)
-
-image_data_generator(
-    "<image> detailed JSON metadata:",
-    [img_from_url("https://upload.wikimedia.org/wikipedia/commons/9/98/Aldrin_Apollo_11_original.jpg")]
-)
+model = models.TransformersVision("trl-internal-testing/tiny-LlavaForConditionalGeneration", model_class=LlavaForConditionalGeneration)
+pattern = "Mercury|Venus|Earth|Mars|Saturn|Jupiter|Neptune|Uranus|Pluto"
+prompt = {
+    "prompts": "<image>detailed description:",
+    "images": img_from_url("https://upload.wikimedia.org/wikipedia/commons/e/e3/Saturn_from_Cassini_Orbiter_%282004-10-06%29.jpg"),
+}
+model(prompt, Json(ImageData))
 ```
 
-> `ImageData(caption='An astronaut on the moon', tags_list=['moon', 'space', 'nasa', 'americanflag'], object_list=['moon', 'moon_surface', 'space_suit', 'americanflag'], is_photo=True)`
-
-
 ## Resources
 
 ### Choosing a model

diff --git a/outlines/models/__init__.py b/outlines/models/__init__.py
@@ -16,13 +16,13 @@
 from .mlxlm import MLXLM, mlxlm
 from .ollama import Ollama
 from .openai import AzureOpenAI, OpenAI
-from .transformers import Transformers, TransformerTokenizer, mamba, transformers
-from .transformers_vision import TransformersVision, transformers_vision
+from .transformers import Mamba, Transformers, TransformerTokenizer
+from .transformers_vision import TransformersVision
 from .vllm import VLLM, vllm
 
 LogitsGenerator = Union[
     Transformers, LlamaCpp, OpenAI, ExLlamaV2Model, MLXLM, VLLM, Ollama
 ]
 
-LocalModel = LlamaCpp
+LocalModel = Union[LlamaCpp, Transformers]
 APIModel = Union[AzureOpenAI, OpenAI, Anthropic, Gemini, Ollama]