-
Notifications
You must be signed in to change notification settings - Fork 569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Transformers audio models #1452
Add support for Transformers audio models #1452
Conversation
""" | ||
prompt = { | ||
"text": text, | ||
"audios": audio_from_url(URL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"audios": audio_from_url(URL) | |
"audio_files": audio_from_url(URL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a minor naming issue
{ | ||
"text": Union[str, List[str]], | ||
"images": Optional[Union[Any, List[Any]]], | ||
"audios": Optional[Union[Any, List[Any]]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"audios": Optional[Union[Any, List[Any]]], | |
"audio_files": Optional[Union[Any, List[Any]]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used audios
because it's the keyword used by the Gwen2Audio processor (for the same reason I changed prompts
into text
). I think letting users choose whatever keyword their processor uses (apart from text
that's shared by all) may be the best solution. We would still use inputs = self.processor(**model_input, padding=True, return_tensors="pt")
. Another advantage is that it lets users provide optional kwargs for the processor call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If those are the name of the arguments in transformers
so we can do inputs = self.processor(**model_input, padding=True, return_tensors="pt")
let's keep them this way.
Addresses issue #1270
This PR turn the class
TransformersVision
intoTransformersMultiModal
that accommodates both vision and audio models. We should be able to use this same class for video models with minimal change, but I did not find a good model to test it out.Something I'm unsure about concerns the keys of the dict provided as a prompt when calling a
TransformersMultiModal
model: I have the impression that they are standardized among transformers processors (text
in all cases,images
for vision models andaudios
for audio models), but I'm not sure of it. If it turns out they are not, we could remove the check on the keys of the dict and give responsibility to the user for providing what their processor needs.I did not add model tests for audio models as the model I found (Qwen/Qwen2-Audio-7B-Instruct) is too heavy to run in the CI (and on some people's device).