forked from dottxt-ai/outlines
-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Introduce
outlines.models.transformers_vision
(dottxt-ai#1052)
Rendered Docs: https://github.com/lapp0/outlines/blob/multimodal-models/docs/reference/models/transformers_vision.md - Fixes dottxt-ai#787 - Fixes dottxt-ai#662 # Changes - Introduce `models.transformers_vision` which subclasses `models.transformers` and overrides its behavior so it applies, instead of `AutoTokenizer`, `AutoProcessor` to handle the text AND `PIL.Images` media - Introduce `VisionSequenceGeneratorAdapter`, handling and validating the `media` argument. - Update `outlines.generate` to dispatch `TransformersVision` models to `VisionSequenceGeneratorAdapter` # Tests - `tests/generate/test_api.py`: Test `prompt` / `media` validation - `tests/generate/test_generate.py`: - Add `model_transformers_vision` fixture. **tests pass locally, but disabled because a model small enough for CI isn't available** - Test all `outlines.generate` generators to ensure dispatchers for this new sequence generator is handled correctly.
- Loading branch information
Showing
10 changed files
with
596 additions
and
79 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,112 @@ | ||
# Transformers Vision | ||
|
||
Outlines allows seamless use of [vision models](https://huggingface.co/learn/computer-vision-course/en/unit4/multimodal-models/tasks-models-part1). | ||
|
||
`outlines.models.transformers_vision` has shares interfaces with, and is based on [`outlines.models.transformers`](./transformers.md). | ||
|
||
Tasks supported include | ||
- image + text -> text | ||
- video + text -> text | ||
|
||
|
||
|
||
## Example: Using [Llava-Next](https://huggingface.co/docs/transformers/en/model_doc/llava_next) Vision Models | ||
|
||
Install dependencies | ||
`pip install torchvision pillow flash-attn` | ||
|
||
Create the model | ||
```python | ||
import outlines | ||
|
||
model = outlines.models.transformers_vision( | ||
"llava-hf/llava-v1.6-mistral-7b-hf", | ||
device="cuda", | ||
) | ||
``` | ||
|
||
Create convenience function to load a `PIL.Image` from URL | ||
```python | ||
from PIL import Image | ||
from io import BytesIO | ||
from urllib.request import urlopen | ||
|
||
def img_from_url(url): | ||
img_byte_stream = BytesIO(urlopen(url).read()) | ||
return Image.open(img_byte_stream).convert("RGB") | ||
``` | ||
|
||
### Describing an image | ||
|
||
```python | ||
description_generator = outlines.generate.text(model) | ||
description_generator( | ||
"<image> detailed description:", | ||
[img_from_url("https://upload.wikimedia.org/wikipedia/commons/2/25/Siam_lilacpoint.jpg")] | ||
) | ||
``` | ||
|
||
> This is a color photograph featuring a Siamese cat with striking blue eyes. The cat has a creamy coat and a light eye color, which is typical for the Siamese breed. Its features include elongated ears, a long, thin tail, and a striking coat pattern. The cat is sitting in an indoor setting, possibly on a cat tower or a similar raised platform, which is covered with a beige fabric, providing a comfortable and soft surface for the cat to rest or perch. The surface of the wall behind the cat appears to be a light-colored stucco or plaster. | ||
#### Multiple Images | ||
|
||
To include multiple images in your prompt you simply add more `<image>` tokens to the prompt | ||
|
||
```python | ||
image_urls = [ | ||
"https://cdn1.byjus.com/wp-content/uploads/2020/08/ShapeArtboard-1-copy-3.png", # triangle | ||
"https://cdn1.byjus.com/wp-content/uploads/2020/08/ShapeArtboard-1-copy-11.png", # hexagon | ||
] | ||
description_generator = outlines.generate.text(model) | ||
description_generator( | ||
"<image><image><image>What shapes are present?", | ||
list(map(img_from_url, image_urls)), | ||
) | ||
``` | ||
|
||
> There are two shapes present. One shape is a hexagon and the other shape is an triangle. ' | ||
|
||
### Classifying an Image | ||
|
||
```python | ||
pattern = "Mercury|Venus|Earth|Mars|Saturn|Jupiter|Neptune|Uranus|Pluto" | ||
planet_generator = outlines.generate.regex(model, pattern) | ||
|
||
planet_generator( | ||
"What planet is this: <image>", | ||
[img_from_url("https://upload.wikimedia.org/wikipedia/commons/e/e3/Saturn_from_Cassini_Orbiter_%282004-10-06%29.jpg")] | ||
) | ||
``` | ||
|
||
> Saturn | ||
|
||
### Extracting Structured Image data | ||
|
||
```python | ||
from pydantic import BaseModel | ||
from typing import List, Optional | ||
|
||
class ImageData(BaseModel): | ||
caption: str | ||
tags_list: List[str] | ||
object_list: List[str] | ||
is_photo: bool | ||
|
||
image_data_generator = outlines.generate.json(model, ImageData) | ||
|
||
image_data_generator( | ||
"<image> detailed JSON metadata:", | ||
[img_from_url("https://upload.wikimedia.org/wikipedia/commons/9/98/Aldrin_Apollo_11_original.jpg")] | ||
) | ||
``` | ||
|
||
> `ImageData(caption='An astronaut on the moon', tags_list=['moon', 'space', 'nasa', 'americanflag'], object_list=['moon', 'moon_surface', 'space_suit', 'americanflag'], is_photo=True)` | ||
|
||
## Resources | ||
|
||
### Chosing a model | ||
- https://mmbench.opencompass.org.cn/leaderboard | ||
- https://huggingface.co/spaces/WildVision/vision-arena |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.