Abstract
Despite rapid progress in the voice style transfer (VST) field, recent zero-shot VST systems still lack the ability to transfer the voice style of a novel speaker. In this paper, we present HierVST, a hierarchical adaptive end-to-end zero-shot VST model. Without any text transcripts, we only use the speech dataset to train the model by utilizing hierarchical variational inference and self-supervised representation. In addition, we adopt a hierarchical adaptive generator that generates the pitch representation and waveform audio sequentially. Moreover, we utilize unconditional generation to improve the speaker-relative acoustic capacity in the acoustic representation. With a hierarchical adaptive structure, the model can adapt to a novel voice style and convert speech progressively. The experimental results demonstrate that our method outperforms other VST models in zero-shot VST scenarios. Audio samples are available at https://hiervst.github.io/.
Original language | English |
---|---|
Pages (from-to) | 4439-4443 |
Number of pages | 5 |
Journal | Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH |
Volume | 2023-August |
DOIs | |
Publication status | Published - 2023 |
Event | 24th International Speech Communication Association, Interspeech 2023 - Dublin, Ireland Duration: 2023 Aug 20 → 2023 Aug 24 |
Bibliographical note
Publisher Copyright:© 2023 International Speech Communication Association. All rights reserved.
Keywords
- self-supervised speech representation
- voice conversion
- voice style transfer
- zero-shot voice conversion
ASJC Scopus subject areas
- Language and Linguistics
- Human-Computer Interaction
- Signal Processing
- Software
- Modelling and Simulation