Recently, Generative Adversarial Networks (GAN) and Variational AutoEncoders (VAE) have been applied to voice conversion that can make use of non-parallel training data. Especially, Conditional Cycle-Consistent Generative Adversarial Networks (CC-GAN) and Cycle-Consistent Variational AutoEncoders (CycleVAE) show promising results in many-to-many voice conversion among multiple speakers. However, the number of speakers has been relatively small in the conventional voice conversion studies using the CC-GANs and the CycleVAEs. In this paper, we extend the number of speakers to 100, and analyze the performances of the many-to-many voice conversion methods experimentally. It has been found through the experiments that the CC-GAN shows 4.5 % less Mel-Cepstral Distortion (MCD) for a small number of speakers, whereas the CycleVAE shows 12.7 % less MCD in a limited training time for a large number of speakers.
Bibliographical notePublisher Copyright:
Copyright © 2022 The Acoustical Society of Korea.
- Conditional Cycle-Consistent Generative Adversarial Network (CC-GAN)
- Cycle-Consistent Variational AutoEncoder (CycleVAE)
- Generative Adversarial Network (GAN)
- Variational AutoEncoder (VAE)
- Voice conversion
ASJC Scopus subject areas
- Signal Processing
- Acoustics and Ultrasonics
- Applied Mathematics
- Speech and Hearing