HierVST: Hierarchical Adaptive Zero-shot Voice Style Transfer

Sang Hoon Lee, Ha Yeong Choi, Hyung Seok Oh, Seong Whan Lee

Research output: Contribution to journalConference articlepeer-review


Despite rapid progress in the voice style transfer (VST) field, recent zero-shot VST systems still lack the ability to transfer the voice style of a novel speaker. In this paper, we present HierVST, a hierarchical adaptive end-to-end zero-shot VST model. Without any text transcripts, we only use the speech dataset to train the model by utilizing hierarchical variational inference and self-supervised representation. In addition, we adopt a hierarchical adaptive generator that generates the pitch representation and waveform audio sequentially. Moreover, we utilize unconditional generation to improve the speaker-relative acoustic capacity in the acoustic representation. With a hierarchical adaptive structure, the model can adapt to a novel voice style and convert speech progressively. The experimental results demonstrate that our method outperforms other VST models in zero-shot VST scenarios. Audio samples are available at https://hiervst.github.io/.

Original languageEnglish
Pages (from-to)4439-4443
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Publication statusPublished - 2023
Event24th International Speech Communication Association, Interspeech 2023 - Dublin, Ireland
Duration: 2023 Aug 202023 Aug 24

Bibliographical note

Publisher Copyright:
© 2023 International Speech Communication Association. All rights reserved.


  • self-supervised speech representation
  • voice conversion
  • voice style transfer
  • zero-shot voice conversion

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modelling and Simulation


Dive into the research topics of 'HierVST: Hierarchical Adaptive Zero-shot Voice Style Transfer'. Together they form a unique fingerprint.

Cite this