Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
AlfredPichard authored Jan 22, 2024
1 parent fb630f1 commit 9129659
Showing 1 changed file with 12 additions and 12 deletions.
24 changes: 12 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,43 +4,43 @@
## Introduction
Recent advances in generative deep learning models provide exciting new tools for music generation. In particular, the conditioning capabilities of diffusion models offer interesting possibilities to add expressive control to the generation process, which helps make it a more accessible tool.

In this project, we apply the novel diffusion method Iterative $\alpha$-(de)Blending[HBC23](https://arxiv.org/abs/2305.03486), which simplifies the usual formalism of stochastic diffusion models to a deterministic one, so as to generate audio loops from pure noise. We use the EnCodec[DCSA20](https://arxiv.org/abs/2210.13438) high fidelity neural autoencoder to generate latent codes of a compressed audio representation. Conditioning is applied using either beats-per-minute information or high-level audio concepts, and reinforced using classifier-free guidance[HS22](https://arxiv.org/abs/2207.12598). The latent codes are then inverted back to a waveform with the decoder.
In this project, we apply the novel diffusion method Iterative $\alpha$-(de)Blending [HBC23](https://arxiv.org/abs/2305.03486), which simplifies the usual formalism of stochastic diffusion models to a deterministic one, so as to generate audio loops from pure noise. We use the EnCodec [DCSA20](https://arxiv.org/abs/2210.13438) high fidelity neural autoencoder to generate latent codes of a compressed audio representation. Conditioning is applied using either beats-per-minute information or high-level audio concepts, and reinforced using classifier-free guidance [HS22](https://arxiv.org/abs/2207.12598). The latent codes are then inverted back to a waveform with the decoder.

Finally, we assess the quality of our method on a large dataset of minimal house and techno music.

## Architecture
The project is currently composed of 3 independant branches, which aim to be merged in the future into a single main one. Each branch corresponds to a different approach of conditioning we decided to apply, but share a common base :
- We rely on EnCodec[DCSA20](https://arxiv.org/abs/2210.13438) Encoder/Decoder, trained on our dataset to train our model with the latent representation of our preferred audio dataset in a lighter format.
- We use a UNet architecture for our Iterative $\alpha$-(de)Blending process[HBC23](https://arxiv.org/abs/2305.03486).
- We rely on EnCodec [DCSA20](https://arxiv.org/abs/2210.13438) Encoder/Decoder, trained on our dataset to train our model with the latent representation of our preferred audio dataset in a lighter format.
- We use a UNet architecture for our Iterative $\alpha$-(de)Blending process [HBC23](https://arxiv.org/abs/2305.03486).

![UNet architecture png](resources/figures/UNet.png)

(See code contained in ’/src’ for further details on each module)

## Conditioning
Conditioning is applied separately in the corresponding branches with CLAP[WCZ+23](https://arxiv.org/abs/2211.06687) ([branch link](https://github.com/AlfredPichard/LADMG/tree/clap)) and a beat-per-minute audio descriptor ([branch link](https://github.com/AlfredPichard/LADMG/tree/bpm_conditioning))
Conditioning is applied separately in the corresponding branches with CLAP [WCZ+23](https://arxiv.org/abs/2211.06687) ([branch link](https://github.com/AlfredPichard/LADMG/tree/clap)) and a beat-per-minute audio descriptor ([branch link](https://github.com/AlfredPichard/LADMG/tree/bpm_conditioning))

## Results
### BPM
To get a learnable representation of rhythm, we use a beat-tracking information extracted from our audio samples metadata during pre-processing. Inference is done using a similar process : to get a more musically coherent output, we condition the input of our model with a constant BPM value that we transform into time signatures then a saw-tooth signal. Results are unequivocal as we can easily hear the influence of our conditioning on our generated audios.

#### Generated loops without bpm conditioning

<audio src="resources/audios/generated_1_no_bpm.wav" controls title="N0_BPM"></audio>
<audio src="resources/audios/generated_1_no_bpm.wav" title="N0_BPM"></audio>

#### Generated loops with bpm conditioning

- 122 BPM :
<audio src="resources/audios/generated_audio_1_122bpm.wav" controls title="122_BPM_1"></audio>
<audio src="resources/audios/generated_audio_9_122bpm.wav" controls title="122_BPM_2"></audio>
<audio src="resources/audios/generated_audio_1_122bpm.wav" title="122_BPM_1"></audio>
<audio src="resources/audios/generated_audio_9_122bpm.wav" title="122_BPM_2"></audio>

- 125 BPM :
<audio src="resources/audios/generated_audio_3_125bpm.wav" controls title="125_BPM_1"></audio>
<audio src="resources/audios/generated_audio_6_125bpm.wav" controls title="125_BPM_2"></audio>
<audio src="resources/audios/generated_audio_3_125bpm.wav" title="125_BPM_1"></audio>
<audio src="resources/audios/generated_audio_6_125bpm.wav" title="125_BPM_2"></audio>

- 128 BPM :
<audio src="resources/audios/generated_audio_4_128bpm.wav" controls title="128_BPM_1"></audio>
<audio src="resources/audios/generated_audio_5_128bpm.wav" controls title="128_BPM_2"></audio>
<audio src="resources/audios/generated_audio_4_128bpm.wav" title="128_BPM_1"></audio>
<audio src="resources/audios/generated_audio_5_128bpm.wav" title="128_BPM_2"></audio>


### CLAP
Expand All @@ -51,4 +51,4 @@ When training the model on conditioning with CLAP latent codes, the model learns
- [HBC23](https://arxiv.org/abs/2305.03486) Eric Heitz, Laurent Belcour, and Thomas Chambon. Iterative alpha-(de)blending: a minimalist deterministic diffusion model. arXiv preprint arXiv:2305.03486, 2023.
- [DCSA20](https://arxiv.org/abs/2210.13438) Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression. preprint arXiv:2210.13438, 2020.
- [HS22](https://arxiv.org/abs/2207.12598) Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
- [WCZ+23](https://arxiv.org/abs/2211.06687) Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2023.
- [WCZ+23](https://arxiv.org/abs/2211.06687) Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2023.

0 comments on commit 9129659

Please sign in to comment.