Diff-HierVC: Diffusion-based Hierarchical Voice Conversion with Robust Pitch Generation and Masked Prior for Zero-shot Speaker Adaptation

Ha Yeong Choi, Sang Hoon Lee, Seong Whan Lee

    Research output: Contribution to journalConference articlepeer-review

    17 Citations (Scopus)

    Abstract

    Although voice conversion (VC) systems have shown a remarkable ability to transfer voice style, existing methods still have an inaccurate pitch and low speaker adaptation quality. To address these challenges, we introduce Diff-HierVC, a hierarchical VC system based on two diffusion models. We first introduce DiffPitch, which can effectively generate F0 with the target voice style. Subsequently, the generated F0 is fed to DiffVoice to convert the speech with a target voice style. Furthermore, using the source-filter encoder, we disentangle the speech and use the converted Mel-spectrogram as a data-driven prior in DiffVoice to improve the voice style transfer capacity. Finally, by using the masked prior in diffusion models, our model can improve the speaker adaptation quality. Experimental results verify the superiority of our model in pitch generation and voice style transfer performance, and our model also achieves a CER of 0.83% and EER of 3.29% in zero-shot VC scenarios.

    Original languageEnglish
    Pages (from-to)2283-2287
    Number of pages5
    JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
    Volume2023-August
    DOIs
    Publication statusPublished - 2023
    Event24th International Speech Communication Association, Interspeech 2023 - Dublin, Ireland
    Duration: 2023 Aug 202023 Aug 24

    Bibliographical note

    Publisher Copyright:
    © 2023 International Speech Communication Association. All rights reserved.

    Keywords

    • diffusion models
    • pitch generation
    • speech restoration
    • voice conversion
    • zero-shot style transfer

    ASJC Scopus subject areas

    • Language and Linguistics
    • Human-Computer Interaction
    • Signal Processing
    • Software
    • Modelling and Simulation

    Fingerprint

    Dive into the research topics of 'Diff-HierVC: Diffusion-based Hierarchical Voice Conversion with Robust Pitch Generation and Masked Prior for Zero-shot Speaker Adaptation'. Together they form a unique fingerprint.

    Cite this