Singing voice synthesis based on frame-level sequence-to-sequence models considering vocal timing deviation

Miku Nishihara, Yukiya Hono, Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda

Department of Computer Science, Nagoya Institute of Technology, Nagoya, Japan

Abstract

This paper proposes singing voice synthesis (SVS) based on frame-level sequence-to-sequence models considering vocal timing deviation. In SVS, it is essential to synchronize the timing of singing with temporal structures represented by scores, taking into account that there are differences between actual vocal timing and note start timing. In many SVS systems including our previous work, phoneme-level score features are converted into frame-level ones on the basis of phoneme boundaries obtained by external aligners to take into account note durations of the musical scores and vocal timing deviations. Therefore, the naturalness of the synthesized singing voices is affected by the aligner accuracy in conventional systems. To alleviate this problem, we introduce an attention mechanism with frame-level features. In the proposed system, the attention mechanism absorbs alignment errors in phoneme boundaries. Additionally, we evaluate the system with pseudo-phoneme-boundaries defined by heuristic rules based on musical scores when there is no aligner. The experimental results show the effectiveness of the proposed system.

Audio samples

The click sound generated based on the tempo of the score was mixed with the synthesized singing voices.

System	Attention mechanism	Phoneme boundaries		MOS		Sample 1	Sample 2
System	Attention mechanism	Training	Synthesis	Quality	Timing	Sample 1	Sample 2
fal w/o att		forced alignment	time-lag model and phoneme duration model	3.63	4.28
fal w/ att	✓	forced alignment	model and phoneme duration model	3.91	4.31
model w/ att	✓	time-lag model and phoneme duration model	time-lag model and phoneme duration model	3.61	3.93
pseudo w/ att	✓	pseudo-phoneme-boundaries	pseudo-phoneme-boundaries	3.68	3.99