This repository contains sandbox experiments in audio processing with deep learning.
Right now, it implements the following technique:
- Classifying recordings of spoken digits with a convolutional neural network on spectrograms
- Modeling of an inverse short-time Fourier transform (STFT) by training a 1D transposed convolution layer
By representing audio signals with spectrograms, such as the spoken number six visualized below, they can be processed in matrix format not unlike the digital pixel images that convolutional neural networks were originally designed for.
The data used here is based on the AudioMNIST dataset of spoken digits by Sören Becker, as found here:
https://github.com/soerenab/AudioMNIST
Some of the techniques used here are inspired by the work of Peter Bermant and his colleagues at the Earth Species Project and their repository on source separation:
https://github.com/earthspecies/cocktail-party-problem
And the torch-stft implementation of pseeth at:
https://github.com/pseeth/torch-stft