🎸 Lumen Data Science 2023 – Audio Classification (2nd place)

πŸ† Fast and Fourier team

Vinko DraguΕ‘ica

Filip Mirković

Ivan Rep

Matej Ciglenečki


Python Virtual Environment

Create and populate the virtual environment. Simply put, the virtual environment allows you to install Python packages for this project only (which you can easily delete later). This way, we won't clutter your global Python packages.

Step 1: Execute the following command:

python3 -m venv venv
source venv/bin/activate
sleep 1
pip install -r requirements.txt
pip install -r requirements-dev.txt

Step 2: Install current directory as a editable Python module:

pip install -e .

(optional) Step 3: Activate pre-commit hook

pre-commit install

Pre-commit, defined in .pre-commit-config.yaml will fix your imports will make sure the code follows Python standards

To remove pre-commit run: rm -rf .git/hooks

πŸ“ Directory structure

Directory Description
data datasets
docs documentation
figures figures
models model checkpoints, model metadata, training reports
references research papers and competition guidelines
src python source code


πŸ“‹ Notes

General links:

Irmas dataset issues

Use cleanlab to find bad lables:

Train and validation dataset, move some validation examples to train

Do this without introducing data leakage, but make sure that we still have enough validation data.

Resizing and chunking

Chunking should happen only in inference in the following way:

  • preprocess 20sec audio normally, send the spectrogram to the model and chunk the spectrogram inside of the predict_step.

We don't do chunking in the train step because we can't chunk the label.

Time window of spectrogram is defined by maximum audio lenght of some train sample. If we chunk that sample, we don't know if the label will appear in every of those chunks.


Add low dim (t-Sne) plot of features to check clusters. How to that:

  • forward pass every example
  • now you have embedding
  • take t-sne


Masked Autoencoders (MAE)

Has script for pretrain but does it work? Written in nn.Module

Pretraining on CNN-s:

Adapter transformer training

Instead of training the transformer backbone, add layers in between the backbone and train those layers. Those layers are called adapters.


Normalization of the audio in time domain (amplitude). Librosa already does this?

Spectrogram normalization, same as any image problem normalization - pre-caculate mean and std and use it in the preprocessing step.

🎡 Datasets

IRMAS dataset

  • IRMAS Test dataset only contains the information about presence of the instruments. Drums and music genre information is not present.
  • examples: 6705
  • instruments: 11
  • duration: 3sec

NSynth: Neural Audio Synthesis

  • examples: 305 979
  • instruments: 1006
  • A novel WaveNet-style autoencoder model that learns codes that meaningfully represent the space of instrument sounds.


  • examples: 330
  • instruments: 11
  • duration: song


  • examples: 122
  • instruments: 80


Distance between classes How to construct tripplets: Softmax loss and center loss:

Some instruments are similar and their class should be (somehow) close together.

Standard classification loss + (alpha * distance between two classes)

  1. distance is probably embedings from some pretrained audio model (audio transformer)

Tripplet loss, how do we form triplets

  1. real: guitar
  2. postive: guitar
  3. negative: not guitar?

Audio which are not instruments

Reserach audio files which are NOT instruments. Both background noises and sounds SIMILAR to instruments! Download the datasets and write dataset loader for them (@matej). Label everything [0, ..., 0]

πŸ’‘βš™οΈ Models and training

Problem: how to encode additional features (drums/no drums, music genre)? We can't create spectrogram out fo those arrays. Maybe simply append one hot encoded values after the spectrogram becomes 1D linear vector?


Current state-of-the-art model for audio classification on multiple datasets and multiple metrics.

paper: github:


AST max duration is 10.23 sec for 16_000hz audio


  • They used 16kHz audio for the pretrained model, so if you want to use the pretrained model, please prepare your data in 16kHz

Idea: introduce multiple MLP (fully conneted layer) heads. Each head will detect a single instrument instead of trying to detect all instruments at once.

Idea: train on single wav, then later introduce irmas_combinatorics dataset which contains multiple wav

LSTM and Melspectrograms (Mirko)

Trained LSTM (with and without Bahdanau attention) on melspectrogram and MFCC features, for single and multiple insturment classification. Adding instruments accroding to genre and randomly was also explored. This approach retains high accuracy due to the class imbalance of the train and validation set, however the F1 metric, with macro averaging in the multi instrument case, remains low in the 0.26 - 0.35 interval. All instruments with higher F1 metrics use Bahdanau attention.

LSTM and Wavelet (Mirko)

Aside from sliding wavelet filters, the output of the wavelet transform needs to be logsacled or preferably trasformed with amplitude_to_db. This does not seem to improve or degrade the performance of the LSTM model with attention, and the F1 score remains in similar margins. Still doing some resarch on Wavelets April 3rd...

Adding instruments (Mirko :( )

Adding instrument waveforms to imitate the examples with multiple insturments needs to be handled with greater care, otherwise it only improves the F1 metric slightly (LSTM) or even lowers it (Wav2Vec2 backbone). A bug was present that I did not catch before. I'm redoing the expereiments.


The idea was to implement a pretrained feature extractor with multiple FCNN (but not necessarily FCNN) heads that serve as disconected binary instrument classifiers. E.g. we wan to classify 5 instruments, hence we use a backbone with 5 FCNNs, and each FCNN searches for it's "own" instrument among the 5.

Fluffy with Wav2Vec2 feature extractor backbone

As was already mentioned, we used only the feature extractor of the pretrained Wav2Vec2 model, and completely disposed of the transformer component for effiency. Up untill this point, the training was performed for ~35 epochs and while the average validation f1 metric remains in the 0.5-0.6 region, it varies significantly across instruments. For most instruments the f1 score remains in the 0.6-0.7 range with numerous outliers, on the high end we have the acoustic guitar and the human voice with f1 above 0.8. This is to be expected, considering the backbone was trained on many instances of human voices. On the low end we have the organ with f1 of ~0.2, and most likely due do bugs in the code the electric guitar with f1 of 0. This could also be atributed to it's similarity with other instruments such as violin or acoustic guitar. This leaves us with a "deathrattle" of sort for this whole "let's use only IRMAS" idea. The idea is to pretrain a feature extractor based on contrastive loss, aslo margins within genres and instrument families should be applied. If this doesn't produce better results the only solution I propose is getting more data, e.g. open MIC.

Fluffy with entire Wav2Vec2

This model has been trained for far fewer epochs ~7, and so far it exhibits the same issues as Fluffy with just the feature extractor. Perhaps more training would be needed, however using such large models requires considerable memory usage, and it's use durign inference time might be limited.

Parallel Mobilenets

  • create 4 Mobilenets which cover 11 instruments
  • forward pass to get features
  • create 4 FC (each FC has 3 instruments)
  • concat predictions
  • create 4 Mobilenets which cover 11 instruments
  • forward pass to get features
  • concat all features


Introduce SVM and train it additionally on high level features of spectrogram (MFCC). For example, one can caculate entropy of a audio/spectrogram for a given timeframe (@vinko)

If you have audio of 3 sec, caculate ~30 entropies every 0.1 sec and use those entropies as SVM features. Also try using a lot more librosa features.

βž• Ensamble

Ensamble should be features of some backbone and Vinko's SVM.

Audio knowledge

Harmonic and Percussive Sounds

Loosely speaking, a harmonic sound is what we perceive as pitched sound, what makes us hear melodies and chords. The prototype of a harmonic sound is the acoustic realization of a sinusoid, which corresponds to a horizontal line in a spectrogram representation. The sound of a violin is another typical example of what we consider a harmonic sound. Again, most of the observed structures in the spectrogram are of horizontal nature (even though they are intermingled with noise-like components). On the other hand, a percussive sound is what we perceive as a clash, a knock, a clap, or a click. The sound of a drum stroke or a transient that occurs in the attack phase of a musical tone are further typical examples. The prototype of a percussive sound is the acoustic realization of an impulse, which corresponds to a vertical line in a spectrogram representation.

πŸ”Š Feature extraction


note: in practice, Mel Spectrograms are used instead of classical spectrogram. We have to normazlie spectrograms images just like any other image dataset (mean/std).,The%20default%20is%20512.

Take an audio sequence and peform SFTF (Short-time Fourier transform) to get spectrums in multiple time intervals. The result is a 3D tensor (time, amplitude, spectrum). STFT has a time window size which is defined by a sampling frequnecy. It is also defined by a window type.

Mel-Frequency Cepstral Coefficients (MFCC)

Spectrogram of Mel Spectrogram:

πŸ₯΄ Augmentations

Audio augmentations

  • white noise
  • time shift
  • amplitude change / normalization
PyTorch Sox effects

allpass, band, bandpass, bandreject, bass, bend, biquad, chorus, channels, compand, contrast, dcshift, deemph, delay, dither, divide, downsample, earwax, echo, echos, equalizer, fade, fir, firfit, flanger, gain, highpass, hilbert, loudness, lowpass, mcompand, norm, oops, overdrive, pad, phaser, pitch, rate, remix, repeat, reverb, reverse, riaa, silence, sinc, speed, stat, stats, stretch, swap, synth, tempo, treble, tremolo, trim, upsample, vad, vol

Spectrum augmentations

SpecAugment: SpecAugment PyTorch: SpecAugment torchaudio:

πŸ”€ Data generation

Naive: concat multiple audio sequences into one and merge their labels. Introduce some overlapping, but not too much!

Use the same genre for data generation: combine sounds which come from the same genre instead of different genres

How to sample?

  • sample audio files [3, 5] but dont use more than 4 instruments
  • sample different starting positions at which the audio will start playing
    • START-----x---x----------x--------x----------END
  • cutoff the audio sequence at max length?

Torch, Librosa and Kaldi

Librosa and Torch give the same array (sound) if both are normalized and converted to mono.

Librosa is gives same array if you load it with sr=None, resample compared to resampling on load.

Best results for AST feature extraction, use torchaudio.load with resampling.


window_shift = int(sample_frequency * frame_shift * 0.001) window_size = int(sample_frequency * frame_length * 0.001)

Librosa hop_length #ms #len 1/(1 / 44100 * 1000) * 20

with a 25ms Hamming window every 10ms (hop)

nfft = 1/(1 / 44100 * 1000) * 25 = 1102 hop = 1/(1 / 44100 * 1000) * 10 = 441