Add support for Transformers audio models #1452

RobinPicard · 2025-02-26T17:07:26Z

Addresses issue #1270

This PR turn the class TransformersVision into TransformersMultiModal that accommodates both vision and audio models. We should be able to use this same class for video models with minimal change, but I did not find a good model to test it out.

Something I'm unsure about concerns the keys of the dict provided as a prompt when calling a TransformersMultiModal model: I have the impression that they are standardized among transformers processors (text in all cases, images for vision models and audios for audio models), but I'm not sure of it. If it turns out they are not, we could remove the check on the keys of the dict and give responsibility to the user for providing what their processor needs.

I did not add model tests for audio models as the model I found (Qwen/Qwen2-Audio-7B-Instruct) is too heavy to run in the CI (and on some people's device).

rlouf · 2025-02-26T19:00:30Z

docs/reference/models/transformers_multimodal.md

+"""
+prompt = {
+    "text": text,
+    "audios": audio_from_url(URL)


Suggested change

"audios": audio_from_url(URL)

"audio_files": audio_from_url(URL)

rlouf

Just a minor naming issue

rlouf · 2025-02-26T19:03:41Z

docs/reference/models/transformers_multimodal.md

+{
+    "text": Union[str, List[str]],
+    "images": Optional[Union[Any, List[Any]]],
+    "audios": Optional[Union[Any, List[Any]]],


Suggested change

"audios": Optional[Union[Any, List[Any]]],

"audio_files": Optional[Union[Any, List[Any]]],

I used audios because it's the keyword used by the Gwen2Audio processor (for the same reason I changed prompts into text). I think letting users choose whatever keyword their processor uses (apart from text that's shared by all) may be the best solution. We would still use inputs = self.processor(**model_input, padding=True, return_tensors="pt"). Another advantage is that it lets users provide optional kwargs for the processor call.

If those are the name of the arguments in transformers so we can do inputs = self.processor(**model_input, padding=True, return_tensors="pt") let's keep them this way.

docs/reference/models/transformers_multimodal.md

Add support for Transformers audio models

49c8669

RobinPicard requested a review from rlouf February 26, 2025 17:07

RobinPicard linked an issue Feb 26, 2025 that may be closed by this pull request

Support Audio models #1270

Closed

RobinPicard added enhancement transformers Linked to the `transformers` integration labels Feb 26, 2025

RobinPicard self-assigned this Feb 26, 2025

rlouf reviewed Feb 26, 2025

View reviewed changes

rlouf approved these changes Feb 26, 2025

View reviewed changes

RobinPicard merged commit f4497dd into dottxt-ai:v1.0 Feb 26, 2025
5 of 6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Transformers audio models #1452

Add support for Transformers audio models #1452

RobinPicard commented Feb 26, 2025

rlouf Feb 26, 2025

rlouf left a comment

rlouf Feb 26, 2025

RobinPicard Feb 26, 2025 •

edited

Loading

rlouf Feb 26, 2025

	"audios": audio_from_url(URL)
	"audio_files": audio_from_url(URL)

	"audios": Optional[Union[Any, List[Any]]],
	"audio_files": Optional[Union[Any, List[Any]]],

Add support for Transformers audio models #1452

Add support for Transformers audio models #1452

Conversation

RobinPicard commented Feb 26, 2025

rlouf Feb 26, 2025

Choose a reason for hiding this comment

rlouf left a comment

Choose a reason for hiding this comment

rlouf Feb 26, 2025

Choose a reason for hiding this comment

RobinPicard Feb 26, 2025 • edited Loading

Choose a reason for hiding this comment

rlouf Feb 26, 2025

Choose a reason for hiding this comment

RobinPicard Feb 26, 2025 •

edited

Loading