PeriodNet: A non-autoregressive raw waveform generative model with a structure separating periodic and aperiodic components

Yukiya Hono, Shinji Takaki, Kei Hashimoto, Keiichiro Oura, Yoshihiko Nankaku, and Keiichi Tokuda

Department of Computer Science, Nagoya Institute of Technology, Nagoya, Japan

Abstruct

This paper presents PeriodNet, a non-autoregressive (non-AR) waveform generative model with a new model structure for modeling periodic and aperiodic components in speech waveforms. Non-AR raw waveform generative models have enabled the fast generation of high-quality waveforms. However, the variations of waveforms that these models can reconstruct are limited by training data. In addition, typical non-AR models reconstruct a speech waveform from a single Gaussian input despite the mixture of periodic and aperiodic signals in speech. These may significantly affect the waveform generation process in some applications such as singing voice synthesis systems, which require reproducing accurate pitch and natural sounds with less periodicity, including husky and breath sounds. PeriodNet uses a parallel or series model structure to model a speech waveform to tackle these problems. Two sub-generators connected in parallel or in series take an explicit periodic and aperiodic signal (sine wave and Gaussian noise) as an input. Since PeriodNet models periodic and aperiodic components by focusing on whether these input signals are autocorrelated or not, it does not require external periodic/aperiodic decomposition during training. Experimental results show that our proposed structure improves the naturalness of generated waveforms. We also show that speech waveforms with a pitch outside of the training data range can be generated with more naturalness.

Audio samples (Japanese)

We used Japanese singing voice corpus with two female and two male singers to evaluate the performance of PeriodNet.
Note that all samples in this page were selected from test set.

Experiment 1 (Section IV)
Experiment 2 (Section V)
Experiment 3 (Section VI)

Experiment 1

This section investigates the effectiveness of PeriodNet using a single female singer dataset.
Seventy Japanese children’s songs (total: 70 min) performed by one female singer (F01) were used.
Singing voice signals were sampled at 48 kHz, and each sample was quantized by 16 bits.
We used one AR baseline model (AR), three non-AR baseline models (BM1, BM2, BM3), and three non-AR proposed models (PM1, PM2, SM).
AR denotes the AR WaveNet that models 48-kHz waveform samples with 8-bit $\mu$-law quantization.
All non-AR systems (BM1, BM2, BM3, PM1, PM2, SM) model 48-kHz waveform samples directly.

1) Comparison of AR/non-AR models and the input signals of non-AR models

This experiment show the effectiveness of using both explicit periodic and aperiodic signals as inputs for non-AR waveform generative models.

	AR (with $\mu$-law quantization)	BM1	BM2	BM3	NAT

Sample 1
Sample 2

2) Comparison of model structures of non-AR waveform generative models

We performed the subjective evaluation with original $F_0$ and upward-shifted $F_0$.
These results indicated that the proposed model structures were effective in generating waveforms with pitches outside the range of training data.


Original $F_0$ scale	Upward-shifted $F_0$ (+1200 cents)

Original $F_0$ scale

	BM3	PM1	PM2	SM	NAT

Sample 1
Sample 2

Upward-shifted $F_0$ (+1200 cents)

	BM3	PM1	PM2	SM

Sample 1
Sample 2

Examples of generated periodic and aperiodic components

Model	Original scale	+1200 cents

PM1

PM2

SM

Experiment 2

This section compares PeriodNet with systems that train using pre-decomposed periodic and aperiodic waveforms.
This comparison aims to evaluate the performance of PeriodNet in modeling speech waveforms while appropriately separating periodic and aperiodic components during the training process.
We used the harmonic plus residual model (HPR) for explicit periodic/aperiodic decomposition.

Subjective evaluation using original $F_0$

We compared the PeriodNet parallel model (PM1 in Experiment 1) and several variants of systems incorporating HPR (see paper for details).
The experimental results indicate that PeriodNet with the proposed structures can model periodic and aperiodic components appropriately without any explicit decomposition process for these components.

	ORG/ORG (=PM1)	HPR/ORG	ORG/HPR	HPR/HPR	NAT
Sample 1
Sample 2

Examples of pre-decomposed and generated periodic and aperiodic components

Target waveform

	Natural waveform as a target waveform of generator	Decomposed waveform as a target waveform of generator
Use standard auxiliary feature	ORG/ORG	HPR/ORG


Use HPR-based auxiliary feature	ORG/HPR	HPR/HPR

Experiment 3

In this section, we examined the effectiveness of the proposed method when using different singers' datasets.
We used one other female singer (F02) and two male singers (M01 and M02).
The dataset of each singer consisted of 70 Japanese children's songs, which were the same songs as F01, while the key and tempo of some of the songs differed for each singer.
Sample 1 of F02, M01, and M02 and Sample 2 of F02 have the same phrase as F01, except for the difference in octave and tempo.
Sample 2 of M01 and M02 is an example of a phrase that contains higher tones than Sample 1.

Subjective evaluation using original $F_0$

The MOS results of F02, M01, and M02 were similar trends to those for F01 in the experiments using the original $F_0$.


F02	M01	M02

	BM1	BM3	PM1	PM2	SM	NAT

Female singer: F02
Sample 1
Sample 2
Male singer: M01
Sample 1
Sample 2
Male singer: M02
Sample 1
Sample 2

Subjective evaluation using upward-shifted $F_0$

The MOS results of F02 were the same trend as that of F01.
On the other hand, for male singers, while the proposed systems showed a better result than BM3, there was no significant difference between PM1, PM2, and SM for M01 and M02.
This difference in trend was caused by the fact that the pitch of the male singing voice is lower than that of the female singing voice and characteristics of the singers.


F02 (+1200 cents)	M01 (+1600 cents)	M02 (+1600 cents)

	BM3	PM1	PM2	SM

Female singer: F02 (+1200 cents)
Sample 1
Sample 2
Male singer: M01 (+1600 cents)
Sample 1
Sample 2
Male singer: M02 (+1600 cents)
Sample 1
Sample 2

Reference

@article{hono2021periodnet,
  title={PeriodNet: A Non-Autoregressive Raw Waveform Generative Model With a Structure Separating Periodic and Aperiodic Components},
  author={Hono, Yukiya and Takaki, Shinji and Hashimoto, Kei and Oura, Keiichiro and Nankaku, Yoshihiko and Tokuda, Keiichi},
  journal={IEEE Access},
  year={2021},
  volume={9},
  pages={137599-137612},
  doi={10.1109/ACCESS.2021.3118033},
}