Singing Voice Synthesis Based on a Musical Note Position-Aware Attention Mechanism
Yukiya Hono, Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda
Nagoya Institute of Technology, Nagoya, Japan
Accepted for ICASSP 2023 (Preprint: arXiv:2212.13703)
Abstruct
This paper proposes a novel sequence-to-sequence (seq2seq) model with a musical note position-aware attention mechanism for singing voice synthesis (SVS). A seq2seq modeling approach that can simultaneously perform acoustic and temporal modeling is attractive. However, due to the difficulty of the temporal modeling of singing voices, many recent SVS systems with an encoder-decoder-based model still rely on explicitly on duration information generated by additional modules. Although some studies perform simultaneous modeling using seq2seq models with an attention mechanism, they have insufficient robustness against temporal modeling. The proposed attention mechanism is designed to estimate the attention weights by considering the rhythm given by the musical score. Furthermore, several techniques are also introduced to improve the modeling performance of the singing voice. Experimental results indicated that the proposed model is effective in terms of both naturalness and robustness of timing.
Audio samples (Japanese)
The samples at the bottom of each row were superimposed with the click sound generated based on the tempo of the score.
In mean opition score (MOS) tests, superimposed samples were used to evaluate overall naturalness considering the vocal timing.
Reference sample
System | Sample 1 | Sample 2 |
---|---|---|
Natural |
|
|
Experiment 1
System | Note position feature $\pmb{p}_{t,n}$ |
Auxiliary note feature |
Guided attention loss $\mathcal{L}_{att}$ |
Sample 1 | Sample 2 |
---|---|---|---|---|---|
Base |
|
|
|||
NF | ✓ |
|
|
||
NP | ✓ |
|
|
||
NP+NF | ✓ | ✓ |
|
|
|
Prop | ✓ | ✓ | ✓ |
|
|
Experiment 2
System | Attention mechanism |
Transition probability | Sample 1 | Sample 2 | |
---|---|---|---|---|---|
Phone-depend. | Time-variant | ||||
NoAtt |
|
|
|||
NoTrans | ✓ |
|
|
||
P-Trans | ✓ | ✓ |
|
|
|
T-Trans | ✓ | ✓ |
|
|
|
Prop | ✓ | ✓ | ✓ |
|
|