MIDI-VOICE: EXPRESSIVE ZERO-SHOT SINGING VOICE SYNTHESIS VIA MIDI-DRIVEN PRIORS

Dong Min Byun, Sang Hoon Lee, Ji Sang Hwang, Seong Whan Lee

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Recently, singing voice synthesis (SVS) models have shown significant progress with generative models. However, previous SVS models inaccurately predict prior and fundamental frequency (F0) for unseen speakers, resulting in a low-quality generated singing voice. To address these issues, in this paper, we propose MIDI-Voice for expressive singing voice synthesis and robust zero-shot singing voice style transfer. We employ a MIDI-based prior to a score-based diffusion model for better singing voice style adaptation. We first generate a MIDI-driven prior from the musical score, and this only includes the note information, not speaker information resulting in high-quality singing voice style adaptation. We also propose a DDSP-based MIDI-style prior for synthesizing a more expressive singing voice and for singing style adaptation, although it requires additional information from the audio. The experimental results show that MIDI-Voice outperforms the previous models in synthesizing an expressive singing voice, and also the superiority in zero-shot singing voice style transfer performance.

Original languageEnglish
Title of host publication2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages12622-12626
Number of pages5
ISBN (Electronic)9798350344851
DOIs
Publication statusPublished - 2024
Event49th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Seoul, Korea, Republic of
Duration: 2024 Apr 142024 Apr 19

Publication series

NameICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
ISSN (Print)1520-6149

Conference

Conference49th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024
Country/TerritoryKorea, Republic of
CitySeoul
Period24/4/1424/4/19

Bibliographical note

Publisher Copyright:
© 2024 IEEE.

Keywords

  • Diffusion models
  • MIDI-driven prior
  • Singing voice synthesis
  • Zero-shot singing voice synthesis

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'MIDI-VOICE: EXPRESSIVE ZERO-SHOT SINGING VOICE SYNTHESIS VIA MIDI-DRIVEN PRIORS'. Together they form a unique fingerprint.

Cite this