Feature extractor #1

Rasipuram · 2019-05-22T06:08:27Z

Hi,

Can you please point me to use this repository for audio feature extraction ?
Does this give a fixed representation for variable length audio signal?

hrbigelow · 2019-05-22T06:32:18Z

Hi, The model is not ready yet. I'm currently training it using the vq-vae bottleneck, and it's only on 5k steps so far, and I've noticed various collapse and other anomalies during the training, so I'm not sure when it will be ready. However, I am working on it every day so hopefully I will figure out how to train it completely. I'll make the fully trained model available. That's a good question. As I understand the paper, each embedding vector represents the phoneme content of a window with a fixed number of timesteps, which is the receptive field size of the encoder. This means it could be part of one phoneme plus the full following phoneme, or just part of a single phoneme, or three and a half, etc. No part of the model has any technique that would make it possible to recognize the same phoneme at different timescales. That problem is called the "forced alignment problem" according to one of the coauthors of a related paper (Yutian Chen) and it wasn't clear to me whether anyone knows how to solve it yet. Henry

…

On Tue, May 21, 2019 at 11:08 PM Rasipuram ***@***.***> wrote: Hi, Can you please point me to use this repository for audio feature extraction ? Does this give a fixed representation for variable length audio signal? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#1?email_source=notifications&email_token=ABI3OFUE245TZD4KZ7BPF63PWTPNXA5CNFSM4HOQ6R62YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4GVD4UWA>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABI3OFQTDBFFCYCB7HY5TA3PWTPNXANCNFSM4HOQ6R6Q> .

Rasipuram · 2019-05-22T08:17:43Z

Thank you for getting back.
I am eager to use this repository for my work. It sounds very interesting!!

I work in the area of Human Behavior Analysis. I am exploring options to extract features from any ML models. It would be great if you can point me to any such repositories.

hrbigelow · 2019-05-23T17:14:14Z

Hi Rasipuram, I am pretty new to this sub-field myself so unfortunately don't know very many repos except the WaveNet ones, which aren't designed to extract features. I'll keep it in mind if I come across some. Best, Henry

…

On Wed, May 22, 2019 at 1:17 AM Rasipuram ***@***.***> wrote: Thank you for getting back. I am eager to use this repository for my work. It sounds very interesting!! I work in the area of Human Behavior Analysis. I am exploring options to extract features from any ML models. It would be great if you can point me to any such repositories. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1?email_source=notifications&email_token=ABI3OFUG4FN2WQ3M24XDKSDPWT6SRA5CNFSM4HOQ6R62YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODV6I65Q#issuecomment-494702454>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABI3OFUE4QVMX7YTQUWBEDDPWT6SRANCNFSM4HOQ6R6Q> .

Rasipuram · 2019-05-24T05:01:45Z

Thank you for getting back.

J22Melody · 2022-04-29T16:06:46Z

Hi here,

I am also looking for an auto-encoder for wav files, and I find this project and this issue.

As said ~ three years ago, the model was not yet ready. Have you made it later? ;)

For me, I have a couple of wav files of a simple word (such as the word "three"), and I want to see how well an auto-encoder can encode the waveform to an embedding space then reconstruct the waveform. Further, I might do some more interpolation on the embedding space to make something interesting happen.

I have also found https://magenta.tensorflow.org/nsynth which perfectly does this job, but their model is mainly designed for music instead of human voice, and it's hard to fine-tune their models.

I think the main idea is very similar (correct me if I am wrong), so I would be very glad to also try out this project to see the reconstruction quality, if I can make it work at all.

Best,
Zifan

hrbigelow · 2022-04-29T17:11:59Z

Hi Zifan,

I'm sorry I cannot be more help here, but I never did succeed in training this model. I tried training it for 10 days on a TPU (full 8x cores) on Google Colab, and it didn't converge. I then tried training a simpler model without the vector quantization, just trying to invert the MFCC encoding. That did work a bit, but was so slow to train, I only could train it to completion on 10% of the librespeech dataset. I believe the reason this model is so slow to train is the decoder is autoregressive and thus sequential, and has to run for so many timesteps.

I haven't done much work on this subject since then. I remember seeing this repo which might be useful as a component for a much faster decoder for MFCC -> Wave generation. This could then be used in an autoencoder like the Jan Chorowski one in this repo.

J22Melody · 2022-05-01T15:57:47Z

Thank you very much for the information ;)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature extractor #1

Feature extractor #1

Rasipuram commented May 22, 2019

hrbigelow commented May 22, 2019 via email

Rasipuram commented May 22, 2019

hrbigelow commented May 23, 2019 via email

Rasipuram commented May 24, 2019

J22Melody commented Apr 29, 2022

hrbigelow commented Apr 29, 2022

J22Melody commented May 1, 2022

Feature extractor #1

Feature extractor #1

Comments

Rasipuram commented May 22, 2019

hrbigelow commented May 22, 2019 via email

Rasipuram commented May 22, 2019

hrbigelow commented May 23, 2019 via email

Rasipuram commented May 24, 2019

J22Melody commented Apr 29, 2022

hrbigelow commented Apr 29, 2022

J22Melody commented May 1, 2022