-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(fish-speech v1.5) bigger real time factor on short texts #744
Comments
I tested without inference audio and the issue remains. |
Yep same exact thing, 500ms latency with very short text on a T4 gpu with api. Slower then realtime(4sec to gen 3sec of audio). Only on longer text, it is much faster then realtime. |
T4 is a very old GPU, it may not work well when doing the prefilling. |
Thanks. What made me wonder was 1.4 was running quite well on T4.
Get Outlook for iOS<https://aka.ms/o0ukef>
…________________________________
From: Leng Yue ***@***.***>
Sent: Friday, December 20, 2024 7:45:11 PM
To: fishaudio/fish-speech ***@***.***>
Cc: Xiangyu Hu ***@***.***>; Author ***@***.***>
Subject: Re: [fishaudio/fish-speech] (fish-speech v1.5) bigger real time factor on short texts (Issue #744)
T4 is a very old GPU, it may not work well when doing the prefilling.
—
Reply to this email directly, view it on GitHub<#744 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AA4WKREAXVQUMSBEI6F7AD32GP7MPAVCNFSM6AAAAABTQ3H7EOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJWHA2DCNJZGA>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
This is quite interesting, as you see, we didn't change model architecture a lot during the update. One thing I am consider is that the embedding table is much larger now, it may then cause some memory bound? |
Interesting. Thanks for pointing out the variance in embedding table. I'll find a time to validate.
Get Outlook for iOS<https://aka.ms/o0ukef>
…________________________________
From: Leng Yue ***@***.***>
Sent: Saturday, December 21, 2024 11:35:34 PM
To: fishaudio/fish-speech ***@***.***>
Cc: Xiangyu Hu ***@***.***>; Author ***@***.***>
Subject: Re: [fishaudio/fish-speech] (fish-speech v1.5) bigger real time factor on short texts (Issue #744)
This is quite interesting, as you see, we didn't change model architecture a lot during the update. One thing I am consider is that the embedding table is much larger now, it may then cause some memory bound?
—
Reply to this email directly, view it on GitHub<#744 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AA4WKRBZ3B3W4QQGEAHGQJT2GWDENAVCNFSM6AAAAABTQ3H7EOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJYGE2TGNJQHE>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
Hi @leng-yue can you let know what are the sizes of embedding table for both 1.4 and 1.5? Would a g5.xlarge (with 24GiB GPU memory) be sufficient? |
Even 6G is enough, vocab size 32k -> 100k. |
Self Checks
Cloud or Self Hosted
Self Hosted (Docker)
Environment Details
Tesla T4
Steps to Reproduce
Upload reference audios
Client makes request specifying
reference_id
.✔️ Expected Behavior
I hope to see a tts latency similar to fish-speech v1.4 at around 500ms for a non-referenced audio generation from a short text with only a few characters.
❌ Actual Behavior
The real time factor for short text chunks is bigger than longer texts.
In my application log above,
audio_duration_ms
is the length of the audio andlatency_ms
is the tts duration.The shortest text here had a real-time-factor < 1.
The text was updated successfully, but these errors were encountered: