Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vision Language Support #176

Open
4 tasks
merveenoyan opened this issue Jan 13, 2025 · 1 comment
Open
4 tasks

Vision Language Support #176

merveenoyan opened this issue Jan 13, 2025 · 1 comment

Comments

@merveenoyan
Copy link

merveenoyan commented Jan 13, 2025

smolagents should have vision language support 💗

This can be in two ways:

  1. Vision language models as tools to parse input to a structured output to pass to LLM agents
  • Having screenshot parsing tools like ShowUI or OmniParser
  • Having document parsing tools like GOT-OCR (which will soon be integrated to transformers)
  1. Using VLMs as the main model in agent and directly passing input
  • this is a more image native approach, curious how it will work with smaller models though
  • the issue is we do not know at which step we input the images. for document use cases first step should be enough, for screenshot cases it could be every step or every step where a click happens. I wonder if we could let model decide this at some point, but for initial support I think this would be too complicated.

This excites me a lot, so I am willing to tackle this one PR at a time.

@aymeric-roucher
Copy link
Collaborator

Let's gooo!
FYI I've made this branch already: https://github.com/huggingface/smolagents/tree/vlm-based-browser, can be useful to get some ideas. The logging of images in memory is rudimentary, but I think the demo with a web browser is quite cool!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants