-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature extractor #1
Comments
Hi,
The model is not ready yet. I'm currently training it using the vq-vae
bottleneck, and it's only on 5k steps so far, and I've noticed various
collapse and other anomalies during the training, so I'm not sure when it
will be ready. However, I am working on it every day so hopefully I will
figure out how to train it completely. I'll make the fully trained model
available.
That's a good question. As I understand the paper, each embedding vector
represents the phoneme content of a window with a fixed number of
timesteps, which is the receptive field size of the encoder. This means it
could be part of one phoneme plus the full following phoneme, or just part
of a single phoneme, or three and a half, etc. No part of the model has
any technique that would make it possible to recognize the same phoneme at
different timescales. That problem is called the "forced alignment
problem" according to one of the coauthors of a related paper (Yutian Chen)
and it wasn't clear to me whether anyone knows how to solve it yet.
Henry
…On Tue, May 21, 2019 at 11:08 PM Rasipuram ***@***.***> wrote:
Hi,
Can you please point me to use this repository for audio feature
extraction ?
Does this give a fixed representation for variable length audio signal?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1?email_source=notifications&email_token=ABI3OFUE245TZD4KZ7BPF63PWTPNXA5CNFSM4HOQ6R62YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4GVD4UWA>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABI3OFQTDBFFCYCB7HY5TA3PWTPNXANCNFSM4HOQ6R6Q>
.
|
Thank you for getting back. I work in the area of Human Behavior Analysis. I am exploring options to extract features from any ML models. It would be great if you can point me to any such repositories. |
Hi Rasipuram,
I am pretty new to this sub-field myself so unfortunately don't know very
many repos except the WaveNet ones, which aren't designed to extract
features. I'll keep it in mind if I come across some.
Best,
Henry
…On Wed, May 22, 2019 at 1:17 AM Rasipuram ***@***.***> wrote:
Thank you for getting back.
I am eager to use this repository for my work. It sounds very interesting!!
I work in the area of Human Behavior Analysis. I am exploring options to
extract features from any ML models. It would be great if you can point me
to any such repositories.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1?email_source=notifications&email_token=ABI3OFUG4FN2WQ3M24XDKSDPWT6SRA5CNFSM4HOQ6R62YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODV6I65Q#issuecomment-494702454>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABI3OFUE4QVMX7YTQUWBEDDPWT6SRANCNFSM4HOQ6R6Q>
.
|
Thank you for getting back. |
Hi here, I am also looking for an auto-encoder for wav files, and I find this project and this issue. As said ~ three years ago, the model was not yet ready. Have you made it later? ;) For me, I have a couple of wav files of a simple word (such as the word "three"), and I want to see how well an auto-encoder can encode the waveform to an embedding space then reconstruct the waveform. Further, I might do some more interpolation on the embedding space to make something interesting happen. I have also found https://magenta.tensorflow.org/nsynth which perfectly does this job, but their model is mainly designed for music instead of human voice, and it's hard to fine-tune their models. I think the main idea is very similar (correct me if I am wrong), so I would be very glad to also try out this project to see the reconstruction quality, if I can make it work at all. Best, |
Hi Zifan, I'm sorry I cannot be more help here, but I never did succeed in training this model. I tried training it for 10 days on a TPU (full 8x cores) on Google Colab, and it didn't converge. I then tried training a simpler model without the vector quantization, just trying to invert the MFCC encoding. That did work a bit, but was so slow to train, I only could train it to completion on 10% of the librespeech dataset. I believe the reason this model is so slow to train is the decoder is autoregressive and thus sequential, and has to run for so many timesteps. I haven't done much work on this subject since then. I remember seeing this repo which might be useful as a component for a much faster decoder for MFCC -> Wave generation. This could then be used in an autoencoder like the Jan Chorowski one in this repo. |
Thank you very much for the information ;) |
Hi,
Can you please point me to use this repository for audio feature extraction ?
Does this give a fixed representation for variable length audio signal?
The text was updated successfully, but these errors were encountered: