Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproduce on Voicebank dataset #2

Open
trfnhle opened this issue Apr 20, 2022 · 5 comments
Open

Reproduce on Voicebank dataset #2

trfnhle opened this issue Apr 20, 2022 · 5 comments

Comments

@trfnhle
Copy link

trfnhle commented Apr 20, 2022

First of all, thank you for your great work
I tried to reproduce on the Voicebank dataset with your code but got some problems. I try inference on checkpoint 100k but the result is not compared to your sample files and still remains background noise.

Some steps I do:

  • Preprocessing Voicebank dataset with flag se
  • Training without any modification
    And here is my loss figure:
    image

Could you get some insight into what possibly was I doing wrong?

@neillu23
Copy link
Owner

Dear @l4zyf9x,

We found there are some issues with different PyTorch and torchaudio versions and we will also try to fix this issue soon.
Could you try with pytorch1.8.0/torchaudio0.8.0 or pytorch1.8.1/torchaudio0.8.1?

Here is my training loss figure:
image

Thank you!

@trfnhle
Copy link
Author

trfnhle commented Apr 20, 2022

@neillu23 Thanks for your quick response
I will try with torch and touch audio version you suggest
Btw, I have some more questions. I noticed that in sample audio, there are *raw_enhanced.wav and *enhanced.wav. What difference between them?
One more thing, when we use flag se_pre, it seems to use clean audio to condition on diffusion step. I just don't see the motive why do you use clean audio in the diffusion step

@neillu23
Copy link
Owner

Hi @l4zyf9x , sorry I missed your last message.
I've replaced torchaudio.load_wav() with the torchaudio.load() function in the new commit. You can try it with the new torch and touchaudio versions.
The *enhanced.wav are further combined with a noise signal with a ratio of 0.2 to recover high-frequency speech
as described at the end of Sec. 4.1, while *raw_enhanced.wav is the result of no combination.
The "se_pre" step was designed for our previous work DiffuSE, we tried the same initialization for CDiffuSE while writing the paper. Afterwards, we found that the pre-training step was no longer needed in CDiffuSE, since the CDiffuSE initialized randomly performed as well as CDiffuSE initialized from pre-trained parameters.
Please try the new code and let me know if you have any further questions!

@KarsonYu
Copy link

Hello,have you try the version author mentioned?And how is the performance?

@Charizard-007
Copy link

Hello,have you try the version author mentioned?And how is the performance?

Hello,have you try the version author mentioned?And how is the performance?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants