This is the homepage for work PHONEix.
《PHONEix: Acoustic Feature Processing Strategy for Enhanced Singing Pronunciation with Phoneme Distribution Predictor》(Submit to ICASSP 2023) Paper will be released soon.
We have merged Muskits with ESPnet and the details of implementation can be found under https://github.com/A-Quarter-Mile/espnet/tree/alf.
-
The installation of ESPnet can follow the official doc here.
-
Then, some extra packages for SVS need to be Installed. The installation scripts is in tools/installers/install_muskit.sh.
- Download datasets
- The Japanese singing dataset Ofuton should be downloaded here.
- The Chinese singing dataset Opencpop should be download here.
-
Change directory to the base directory
# e.g. cd egs2/ofuton/svs1/
You can perform any other recipes as the same way. e.g.
opencpop
. Keep in mind that all scripts should be ran at the level ofegs2/*/svs1
. -
Change the configuration Describing the directory structure as follows:
egs2/ofuton/svs1/ - conf/ # Configuration files for training, inference, etc. - scripts/ # Bash utilities of muskit - pyscripts/ # Python utilities of muskit - utils/ # From Kaldi utilities - db.sh # The directory path of each corpora - path.sh # Setup script for environment variables - cmd.sh # Configuration for your backend of job scheduler - run.sh # Entry point - svs.sh # Invoked by run.sh
- You need to modify
db.sh
for specifying your corpus before executingrun.sh
. For example, when you touch the recipe ofegs2/ofuton
, you need to change the paths ofOFUTON
indb.sh
. path.sh
is used to set up the environment forrun.sh
. Note that the Python interpreter used for ESPnet is not the current Python of your terminal, but it's the Python which was installed attools/
. Thus you need to sourcepath.sh
to use this Python.. path.sh python
cmd.sh
is used for specifying the backend of the job scheduler. If you don't have such a system in your local machine environment, you don't need to change anything about this file. See Using Job scheduling systemrun.sh
is an example script to run all stages. To run the model with PHONEix, you need to set the--train_config train_naive_rnn_dp_alf.yaml
for LSTM-based model and--train_config train_xiaoice_alf.yaml
for Transformer-based model.
- You need to modify
-
Run
run.sh
./run.sh
The procedures in
run.sh
can be divided into some stages, e.g. data preparation, training, and evaluation. You can specify the starting stage and the stopping stage../run.sh --stage 2 --stop-stage 6
There are also some altenative otpions to skip specified stages:
run.sh --skip_data_prep true # Skip data preparation stages. run.sh --skip_train true # Skip training stages. run.sh --skip_eval true # Skip decoding and evaluation stages. run.sh --skip_upload false # Enable packing and uploading stages.
We provide three objective evaluation metrics:
- Mel-cepstral distortion (MCD)
- Voiced / unvoiced error rate (VUV_E)
- Logarithmic rooted mean square error of the fundamental frequency (FRMSE).
We apply dynamic time-warping (DTW) to match the length difference between ground-truth singing and generated singing.
Here we show the example command to calculate objective metrics:
cd egs2/<recipe_name>/svs1
. ./path.sh
# Evaluate MCD
./pyscripts/utils/evaluate_mcd.py \
exp/<model_dir_name>/<decode_dir_name>/eval/wav/wav.scp \
dump/raw/eval/wav.scp
While these objective metrics can estimate the quality of synthesized singing, it is still difficult to fully determine human perceptual quality from these values, especially with high-fidelity generated singing. Therefore, we recommend performing the subjective evaluation (eg. MOS) if you want to check perceptual quality in detail.
You can refer this page to launch web-based subjective evaluation system with webMUSHRA.
We follow the processing pipelines in Xiaoice. It accepts score features at note level and expands frame lengths according to the phoneme duration produced by HMM-based forced-alignment.
The codes is in the same repositories as PHONEix, under https://github.com/A-Quarter-Mile/espnet/tree/alf. To run the model with Type1 strategy, you need to set the --train_config train_naive_rnn_dp.yaml
for LSTM-based model and --train_config train_xiaoice.yaml
for Transformer-based model. The other operations are the same as Codes for PHONEix
Type2 use the annotated phoneme time sequence as acoustic encoder input and length expansion ground truth for the length regulator. We analyze the acoustic feature follow this work.
Codes can be found in Muskit. Muskit follows ESPnet and Kaldi style data processing. The main structure and base codes are adapted from ESPnet. The Installation and running instructions are provided in Muskit.
To run the model with Type2 strategy in Muskit, you need to set the --train_config train_naive_rnn_dp.yaml
for LSTM-based model and --train_config train_xiaoice.yaml
for Transformer-based model. The other operations are the same as Codes for PHONEix
It shows the comparison of the proposed acoustic processing strategy, PHONEix, with baselines (i.e., Type 1 and Type 2). It is evaluated with Bi-LSTM or Transformer based encoder-decoder structures on Ofuton and Opencpop.
Examples/dataset/model_strategy/*.wav
# e.g. Examples/Ofuton/LSTM_PHONEix/1.wav is the result for LSTM-based model with PHONEix strategy in Ofuton dataset.