Abstract
Recently, there has been a growing demand for conversational speech synthesis (CSS) that generates more natural speech by considering the conversational context. To address this, we introduce JELLY, a novel CSS framework that integrates emotion recognition and context reasoning for generating appropriate speech in conversation by fine-tuning a large language model (LLM) with multiple partial LoRA modules. We propose an Emotion-aware Q-former encoder, which enables the LLM to perceive emotions in speech. The encoder is trained to align speech emotions with text, utilizing datasets of emotional speech. The entire model is then fine-tuned with conversational speech data to infer emotional context for generating emotionally appropriate speech in conversation. Our experimental results demonstrate that JELLY excels in emotional context modeling, synthesizing speech that naturally aligns with conversation, while mitigating the scarcity of emotional conversational speech datasets.
| Original language | English |
|---|---|
| Title of host publication | 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025 - Proceedings |
| Editors | Bhaskar D Rao, Isabel Trancoso, Gaurav Sharma, Neelesh B. Mehta |
| Publisher | Institute of Electrical and Electronics Engineers Inc. |
| ISBN (Electronic) | 9798350368741 |
| DOIs | |
| Publication status | Published - 2025 |
| Event | 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025 - Hyderabad, India Duration: 2025 Apr 6 → 2025 Apr 11 |
Publication series
| Name | ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings |
|---|---|
| ISSN (Print) | 1520-6149 |
Conference
| Conference | 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025 |
|---|---|
| Country/Territory | India |
| City | Hyderabad |
| Period | 25/4/6 → 25/4/11 |
Bibliographical note
Publisher Copyright:© 2025 IEEE.
Keywords
- Conversational speech synthesis
- dialogue
- emotional context reasoning
- large language model
- low-rank adaptation
ASJC Scopus subject areas
- Software
- Signal Processing
- Electrical and Electronic Engineering
Fingerprint
Dive into the research topics of 'JELLY: Joint Emotion Recognition and Context Reasoning with LLMs for Conversational Speech Synthesis'. Together they form a unique fingerprint.Cite this
- APA
- Standard
- Harvard
- Vancouver
- Author
- BIBTEX
- RIS