MS-sg: A baseline system in which `sg` means stop gradient. The acoustic model was trained using only the objective in the acoustic feature domain. The prenets were not used. The waveform was synthesized by feeding the predicted acoustic features to the mel-cepstral synthesis filter.
MS: Same as MS-sg except that the acoustic model was trained using the two objectives. The one is the Gaussian loss in the acoustic feature domain and the other one is the multi-resolution STFT loss in the waveform domain.
PMS-sg: The acoustic model and the prenets were simultaneously trained using the two objectives. However, the STFT loss were propagated through only the hidden variables conditiond on the prenets to the acoustic model, i.e., mel-cepstral coefficients were not explicitly affected by the STFT loss.
PMS: The acoustic model and the prenets were simultaneously trained using the two objectives.
PN: The acoustic model was trained using only the objective in the acoustic feature domain. The waveform was synthesized by feeding the predicted acoustic features to the PeriodNet which is a neural vocoder suitable for singing speech synthesis. The PeriodNet model was separately trained on the same training data.