Abstract
Despite rapid advances in the field of emotional text-to-speech (TTS), recent studies primarily focus on mimicking the average style of a particular emotion. As a result, the ability to manipulate speech emotion remains constrained to several predefined labels, compromising the ability to reflect the nuanced variations of emotion. In this paper, we propose EmoSphere-TTS, which synthesizes expressive emotional speech by using a spherical emotion vector to control the emotional style and intensity of the synthetic speech. Without any human annotation, we use the arousal, valence, and dominance pseudo-labels to model the complex nature of emotion via a Cartesian-spherical transformation. Furthermore, we propose a dual conditional adversarial network to improve the quality of generated speech by reflecting the multi-aspect characteristics. The experimental results demonstrate the model's ability to control emotional style and intensity with high-quality expressive speech.
| Original language | English |
|---|---|
| Pages (from-to) | 1810-1814 |
| Number of pages | 5 |
| Journal | Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH |
| DOIs | |
| Publication status | Published - 2024 |
| Event | 25th Interspeech Conferece 2024 - Kos Island, Greece Duration: 2024 Sept 1 → 2024 Sept 5 |
Bibliographical note
Publisher Copyright:© 2024 International Speech Communication Association. All rights reserved.
Keywords
- emotional style and intensity control
- expressive emotional speech synthesis
- Text-to-speech
ASJC Scopus subject areas
- Language and Linguistics
- Human-Computer Interaction
- Signal Processing
- Software
- Modelling and Simulation
Fingerprint
Dive into the research topics of 'EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speech'. Together they form a unique fingerprint.Cite this
- APA
- Standard
- Harvard
- Vancouver
- Author
- BIBTEX
- RIS