EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speech

  • Deok Hyeon Cho
  • , Hyung Seok Oh
  • , Seung Bin Kim
  • , Sang Hoon Lee
  • , Seong Whan Lee*
  • *Corresponding author for this work

Research output: Contribution to journalConference articlepeer-review

Abstract

Despite rapid advances in the field of emotional text-to-speech (TTS), recent studies primarily focus on mimicking the average style of a particular emotion. As a result, the ability to manipulate speech emotion remains constrained to several predefined labels, compromising the ability to reflect the nuanced variations of emotion. In this paper, we propose EmoSphere-TTS, which synthesizes expressive emotional speech by using a spherical emotion vector to control the emotional style and intensity of the synthetic speech. Without any human annotation, we use the arousal, valence, and dominance pseudo-labels to model the complex nature of emotion via a Cartesian-spherical transformation. Furthermore, we propose a dual conditional adversarial network to improve the quality of generated speech by reflecting the multi-aspect characteristics. The experimental results demonstrate the model's ability to control emotional style and intensity with high-quality expressive speech.

Original languageEnglish
Pages (from-to)1810-1814
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
DOIs
Publication statusPublished - 2024
Event25th Interspeech Conferece 2024 - Kos Island, Greece
Duration: 2024 Sept 12024 Sept 5

Bibliographical note

Publisher Copyright:
© 2024 International Speech Communication Association. All rights reserved.

Keywords

  • emotional style and intensity control
  • expressive emotional speech synthesis
  • Text-to-speech

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modelling and Simulation

Fingerprint

Dive into the research topics of 'EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speech'. Together they form a unique fingerprint.

Cite this