Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Transformers audio models #1452

Merged
merged 1 commit into from
Feb 26, 2025

Conversation

RobinPicard
Copy link
Contributor

Addresses issue #1270

This PR turn the class TransformersVision into TransformersMultiModal that accommodates both vision and audio models. We should be able to use this same class for video models with minimal change, but I did not find a good model to test it out.

Something I'm unsure about concerns the keys of the dict provided as a prompt when calling a TransformersMultiModal model: I have the impression that they are standardized among transformers processors (text in all cases, images for vision models and audios for audio models), but I'm not sure of it. If it turns out they are not, we could remove the check on the keys of the dict and give responsibility to the user for providing what their processor needs.

I did not add model tests for audio models as the model I found (Qwen/Qwen2-Audio-7B-Instruct) is too heavy to run in the CI (and on some people's device).

@RobinPicard RobinPicard requested a review from rlouf February 26, 2025 17:07
@RobinPicard RobinPicard linked an issue Feb 26, 2025 that may be closed by this pull request
@RobinPicard RobinPicard added enhancement transformers Linked to the `transformers` integration labels Feb 26, 2025
@RobinPicard RobinPicard self-assigned this Feb 26, 2025
"""
prompt = {
"text": text,
"audios": audio_from_url(URL)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"audios": audio_from_url(URL)
"audio_files": audio_from_url(URL)

Copy link
Member

@rlouf rlouf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a minor naming issue

{
"text": Union[str, List[str]],
"images": Optional[Union[Any, List[Any]]],
"audios": Optional[Union[Any, List[Any]]],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"audios": Optional[Union[Any, List[Any]]],
"audio_files": Optional[Union[Any, List[Any]]],

Copy link
Contributor Author

@RobinPicard RobinPicard Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used audios because it's the keyword used by the Gwen2Audio processor (for the same reason I changed prompts into text). I think letting users choose whatever keyword their processor uses (apart from text that's shared by all) may be the best solution. We would still use inputs = self.processor(**model_input, padding=True, return_tensors="pt"). Another advantage is that it lets users provide optional kwargs for the processor call.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If those are the name of the arguments in transformers so we can do inputs = self.processor(**model_input, padding=True, return_tensors="pt") let's keep them this way.

@RobinPicard RobinPicard merged commit f4497dd into dottxt-ai:v1.0 Feb 26, 2025
5 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement transformers Linked to the `transformers` integration
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Audio models
2 participants