Cross-Lingual Text-to-Speech via Hierarchical Style Transfer

Sang Hoon Lee, Ha Yeong Choi, Seong Whan Lee

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

This paper presents LIMITLESS, a cross-lingual text-to-speech via hierarchical style transfer that can transfer the prosody and voice style, respectively. Building upon HierSpeech++, we utilize the 2-stage hierarchical speech synthesis frameworks with text-to-vector (TTV) and vector-to-speech. We simply modify the TTV by adding the language embedding of each language on the text representation and use the hierarchical speech synthesizer without modification. We train the TTV model with 7 languages and 14 speakers from the Indic languages dataset which was released for LIMMITS 2024 and fine-tuned the TTV model with target speakers for Track 1 and 2. The results show that our framework can transfer voice style robustly in terms of speaker similarity.

Original languageEnglish
Title of host publication2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages25-26
Number of pages2
ISBN (Electronic)9798350374513
DOIs
Publication statusPublished - 2024
Event49th IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024 - Seoul, Korea, Republic of
Duration: 2024 Apr 142024 Apr 19

Publication series

Name2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024 - Proceedings

Conference

Conference49th IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024
Country/TerritoryKorea, Republic of
CitySeoul
Period24/4/1424/4/19

Bibliographical note

Publisher Copyright:
© 2024 IEEE.

Keywords

  • Cross-lingual TTS
  • Multi-lingual TTS

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Networks and Communications
  • Signal Processing
  • Media Technology
  • Acoustics and Ultrasonics

Fingerprint

Dive into the research topics of 'Cross-Lingual Text-to-Speech via Hierarchical Style Transfer'. Together they form a unique fingerprint.

Cite this