Abstract
Recently, singing voice synthesis (SVS) models have shown significant progress with generative models. However, previous SVS models inaccurately predict prior and fundamental frequency (F0) for unseen speakers, resulting in a low-quality generated singing voice. To address these issues, in this paper, we propose MIDI-Voice for expressive singing voice synthesis and robust zero-shot singing voice style transfer. We employ a MIDI-based prior to a score-based diffusion model for better singing voice style adaptation. We first generate a MIDI-driven prior from the musical score, and this only includes the note information, not speaker information resulting in high-quality singing voice style adaptation. We also propose a DDSP-based MIDI-style prior for synthesizing a more expressive singing voice and for singing style adaptation, although it requires additional information from the audio. The experimental results show that MIDI-Voice outperforms the previous models in synthesizing an expressive singing voice, and also the superiority in zero-shot singing voice style transfer performance.
Original language | English |
---|---|
Title of host publication | 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 12622-12626 |
Number of pages | 5 |
ISBN (Electronic) | 9798350344851 |
DOIs | |
Publication status | Published - 2024 |
Event | 49th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Seoul, Korea, Republic of Duration: 2024 Apr 14 → 2024 Apr 19 |
Publication series
Name | ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings |
---|---|
ISSN (Print) | 1520-6149 |
Conference
Conference | 49th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 |
---|---|
Country/Territory | Korea, Republic of |
City | Seoul |
Period | 24/4/14 → 24/4/19 |
Bibliographical note
Publisher Copyright:© 2024 IEEE.
Keywords
- Diffusion models
- MIDI-driven prior
- Singing voice synthesis
- Zero-shot singing voice synthesis
ASJC Scopus subject areas
- Software
- Signal Processing
- Electrical and Electronic Engineering