New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Vision Language Support #176

Open

4 tasks

merveenoyan opened this issue Jan 13, 2025 · 1 comment

merveenoyan commented Jan 13, 2025 •

edited

Loading

smolagents should have vision language support 💗

This can be in two ways:

Vision language models as tools to parse input to a structured output to pass to LLM agents

Having screenshot parsing tools like ShowUI or OmniParser
Having document parsing tools like GOT-OCR (which will soon be integrated to transformers)

Using VLMs as the main model in agent and directly passing input

this is a more image native approach, curious how it will work with smaller models though
the issue is we do not know at which step we input the images. for document use cases first step should be enough, for screenshot cases it could be every step or every step where a click happens. I wonder if we could let model decide this at some point, but for initial support I think this would be too complicated.

This excites me a lot, so I am willing to tackle this one PR at a time.

The text was updated successfully, but these errors were encountered:

Collaborator

aymeric-roucher commented Jan 13, 2025

Let's gooo!
FYI I've made this branch already: https://github.com/huggingface/smolagents/tree/vlm-based-browser, can be useful to get some ideas. The logging of images in memory is rudimentary, but I think the demo with a web browser is quite cool!

This was referenced Jan 13, 2025

Add VLM support #177

Closed

Add PDFParser #187

Open

NSTiwari mentioned this issue

How to pass images as input to CodeAgent? #298

Open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment