PeriodCodec: A Pitch-Controllable Neural Audio Codec Using Periodic Signals for Singing Voice Synthesis

*Masato Takagi, Miku Nishihara, Yukiya Hono, Kei Hashimoto, Yoshihiko Nankaku, Keiichi Tokuda

Nagoya Institute of Technology, Japan

PeriodCodec Architecture

Abstract

Neural audio codecs (NACs) have attracted considerable attention in the field of text-to-speech. However, previous methods don't offer a mechanism for explicit controlling the fundamental frequency (F0), hence they are not suitable for singing voice synthesis. To overcome this limitation, we propose a NAC that can control F0 by introducing explicit periodic signals into the decoder. This architecture enables direct manipulation of F0 during the synthesis process. Experimental results show that our proposed method achieves F0 control and improves synthesis quality compared to previous methods. Furthermore, by including singing voices in the training data set, we showed that both F0 controllability and the quality of singing voices are improved, enabling the construction of a NAC suitable for singing voice synthesis tasks.

Methods

Name Proposed Further Experiments
Explicit Periodic Signal Generator & Downsampler Pitch Predictor with Gradient Reversal Layer Training with singing voice data (GTSinger)
Base
Base+GT
Period
Period+GT
Period-GRL
Period-GRL+GT

Audio Samples

Select a speaker ID
Select a pitch shift value
arrow_drop_down
arrow_drop_down

Natural (Ground Truth)

Base

Base+GT

Period

Period+GT

Period-GRL

Period-GRL+GT

Manual Pitch Editing Demo

Natural

Add Trill
Add Trill

Period+GT

Period+GT-GRL

Auto Tuned
Auto Tuned (Pitch Quantized)

Period+GT

Period+GT-GRL

Pitch Jump
Pitch Jump (+6)

Period+GT

Period+GT-GRL

Time Vibrato
Time Vibrato

Period+GT

Period+GT-GRL