Skip to content

Latest commit

 

History

History
229 lines (166 loc) · 8.44 KB

README.md

File metadata and controls

229 lines (166 loc) · 8.44 KB

Lossless BS-RoFormer

Install

You can download the pre-trained model from ZFTurbo/Music-Source-Separation-Training and put it in the bs_roformer folder. It will be downloaded automatically if it is not found and no model is specified.

$ python3 -m pip install Lossless-BS-RoFormer

Run

On a single file with the default arguments:

$ python3 -m bs_roformer input.wav

Or on a folder with multiple files with custom arguments:

$ python3 -m bs_roformer --model_type bs_roformer --start_check_point bs_roformer/model_bs_roformer_ep_17_sdr_9.6568.ckpt --config_path bs_roformer/config_bs_roformer_384_8_2_485100.yaml --input_folder input --output_folder output

Lossless mode

The --lossless flag enables perfect reconstruction of the original mix by intelligently distributing any residual content back into the stems. This ensures that when all stems are summed together, they exactly match the input mix.

By default, music source separation can be slightly lossy -- the sum of the separated stems may not perfectly equal the original mix due to model limitations. The lossless mode addresses this by:

  1. Calculating the residual (difference between original mix and sum of stems)
  2. Running the model again on this residual to classify its content
  3. Using a hybrid approach to distribute the residual:
    • Clear drum/percussion content goes to the drums stem
    • Clear musical/harmonic content goes to the other stem
    • Ambiguous content is distributed proportionally based on the energy ratio

This results in:

  • Perfect reconstruction when stems are summed
  • Musically appropriate distribution of residual content
  • Minimal impact on separation quality of primary elements

Usage

To use lossless mode, simply add the --lossless flag to your command:

$ python3 -m bs_roformer --lossless input.wav

Note that this will run the model twice, so it will be 2x slower than the original mode. You can disable lossless mode by using the --disable-lossless flag.


BS-RoFormer

Implementation of Band Split Roformer, SOTA Attention network for music source separation out of ByteDance AI Labs. They beat the previous first place by a large margin. The technique uses axial attention across frequency (hence multi-band) and time. They also have experiments to show that rotary positional encoding led to a huge improvement over learned absolute positions.

It also includes support for stereo training and outputting multiple stems.

Please join Join us on Discord if you are interested in replicating a SOTA music source separator out in the open

Update: This paper has been replicated by Roman and weight open sourced here

Update 2: Used for this Katy Perry remix!

Update 3: Kimberley Jensen has open sourced a MelBand Roformer trained on vocals here!

Appreciation

  • StabilityAI and 🤗 Huggingface for the generous sponsorship, as well as my other sponsors, for affording me the independence to open source artificial intelligence.

  • Roee and Fabian-Robert for sharing their audio expertise and fixing audio hyperparameters

  • @chenht2010 and Roman for working out the default band splitting hyperparameter!

  • Max Prod for reporting a big bug with Mel-Band Roformer with stereo training!

  • Roman for successfully training the model and open sourcing his training code and weights at this repository!

  • Christopher for fixing an issue with multiple stems in Mel-Band Roformer

  • Iver Jordal for identifying that the default stft window function is not correct

Install

$ pip install BS-RoFormer

Usage

import torch
from bs_roformer import BSRoformer

model = BSRoformer(
    dim = 512,
    depth = 12,
    time_transformer_depth = 1,
    freq_transformer_depth = 1
)

x = torch.randn(2, 352800)
target = torch.randn(2, 352800)

loss = model(x, target = target)
loss.backward()

# after much training

out = model(x)

To use the Mel-Band Roformer proposed in a recent follow up paper, simply import MelBandRoformer instead

import torch
from bs_roformer import MelBandRoformer

model = MelBandRoformer(
    dim = 32,
    depth = 1,
    time_transformer_depth = 1,
    freq_transformer_depth = 1
)

x = torch.randn(2, 352800)
target = torch.randn(2, 352800)

loss = model(x, target = target)
loss.backward()

# after much training

out = model(x)

Todo

  • get the multiscale stft loss in there
  • figure out what n_fft should be
  • review band split + mask estimation modules

Citations

@inproceedings{Lu2023MusicSS,
    title   = {Music Source Separation with Band-Split RoPE Transformer},
    author  = {Wei-Tsung Lu and Ju-Chiang Wang and Qiuqiang Kong and Yun-Ning Hung},
    year    = {2023},
    url     = {https://api.semanticscholar.org/CorpusID:261556702}
}
@inproceedings{Wang2023MelBandRF,
    title   = {Mel-Band RoFormer for Music Source Separation},
    author  = {Ju-Chiang Wang and Wei-Tsung Lu and Minz Won},
    year    = {2023},
    url     = {https://api.semanticscholar.org/CorpusID:263608675}
}
@misc{ho2019axial,
    title  = {Axial Attention in Multidimensional Transformers},
    author = {Jonathan Ho and Nal Kalchbrenner and Dirk Weissenborn and Tim Salimans},
    year   = {2019},
    archivePrefix = {arXiv}
}
@misc{su2021roformer,
    title   = {RoFormer: Enhanced Transformer with Rotary Position Embedding},
    author  = {Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu},
    year    = {2021},
    eprint  = {2104.09864},
    archivePrefix = {arXiv},
    primaryClass = {cs.CL}
}
@inproceedings{dao2022flashattention,
    title   = {Flash{A}ttention: Fast and Memory-Efficient Exact Attention with {IO}-Awareness},
    author  = {Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and R{\'e}, Christopher},
    booktitle = {Advances in Neural Information Processing Systems},
    year    = {2022}
}
@article{Bondarenko2023QuantizableTR,
    title   = {Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing},
    author  = {Yelysei Bondarenko and Markus Nagel and Tijmen Blankevoort},
    journal = {ArXiv},
    year    = {2023},
    volume  = {abs/2306.12929},
    url     = {https://api.semanticscholar.org/CorpusID:259224568}
}
@inproceedings{ElNouby2021XCiTCI,
    title   = {XCiT: Cross-Covariance Image Transformers},
    author  = {Alaaeldin El-Nouby and Hugo Touvron and Mathilde Caron and Piotr Bojanowski and Matthijs Douze and Armand Joulin and Ivan Laptev and Natalia Neverova and Gabriel Synnaeve and Jakob Verbeek and Herv{\'e} J{\'e}gou},
    booktitle = {Neural Information Processing Systems},
    year    = {2021},
    url     = {https://api.semanticscholar.org/CorpusID:235458262}
}
@inproceedings{Zhou2024ValueRL,
    title   = {Value Residual Learning For Alleviating Attention Concentration In Transformers},
    author  = {Zhanchao Zhou and Tianyi Wu and Zhiyun Jiang and Zhenzhong Lan},
    year    = {2024},
    url     = {https://api.semanticscholar.org/CorpusID:273532030}
}