WaveVC: Speech and Fundamental Frequency Consistent Raw Audio Voice Conversion

Kyungdeuk Ko, Donghyeon Kim, Kyungseok Oh, Hanseok Ko

Research output: Contribution to journalArticlepeer-review

Abstract

Voice conversion (VC) is a task for changing the speech of a source speaker to the target voice while preserving linguistic information of the source speech. The existing VC methods typically use mel-spectrogram as both input and output, so a separate vocoder is required to transform mel-spectrogram into waveform. Therefore, the VC performance varies depending on the vocoder performance, and noisy speech can be generated due to problems such as train-test mismatch. In this paper, we propose a speech and fundamental frequency consistent raw audio voice conversion method called WaveVC. Unlike other methods, WaveVC does not require a separate vocoder and can perform VC directly on raw audio waveform using 1D convolution. This eliminates the issue of performance degradation caused by the train-test mismatch of the vocoder. In the training phase, WaveVC employs speech loss and F0 loss to preserve the content of the source speech and generate F0 consistent speech using the pre-trained networks. WaveVC is capable of converting voices while maintaining consistency in speech and fundamental frequency. In the test phase, the F0 feature of the source speech is concatenated with a content embedding vector to ensure the converted speech follows the fundamental frequency flow of the source speech. WaveVC achieves higher performances than baseline methods in both many-to-many VC and any-to-any VC. The converted samples are available online.

Original languageEnglish
Article number166
JournalNeural Processing Letters
Volume56
Issue number4
DOIs
Publication statusPublished - 2024 Aug

Bibliographical note

Publisher Copyright:
© The Author(s) 2024.

Keywords

  • Adversarial training
  • Deep learning
  • Voice conversion

ASJC Scopus subject areas

  • Software
  • General Neuroscience
  • Computer Networks and Communications
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'WaveVC: Speech and Fundamental Frequency Consistent Raw Audio Voice Conversion'. Together they form a unique fingerprint.

Cite this