You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Vision language models as tools to parse input to a structured output to pass to LLM agents
Having screenshot parsing tools like ShowUI or OmniParser
Having document parsing tools like GOT-OCR (which will soon be integrated to transformers)
Using VLMs as the main model in agent and directly passing input
this is a more image native approach, curious how it will work with smaller models though
the issue is we do not know at which step we input the images. for document use cases first step should be enough, for screenshot cases it could be every step or every step where a click happens. I wonder if we could let model decide this at some point, but for initial support I think this would be too complicated.
This excites me a lot, so I am willing to tackle this one PR at a time.
The text was updated successfully, but these errors were encountered:
smolagents should have vision language support 💗
This can be in two ways:
step
we input the images. for document use cases first step should be enough, for screenshot cases it could be every step or every step where a click happens. I wonder if we could let model decide this at some point, but for initial support I think this would be too complicated.This excites me a lot, so I am willing to tackle this one PR at a time.
The text was updated successfully, but these errors were encountered: