RapFlow-TTS: Rapid and High-Fidelity Text-to-Speech with Improved Consistency Flow Matching

  • Hyun Joon Park*
  • , Jeongmin Liu
  • , Jin Sob Kim
  • , Jeong Yeol Yang
  • , Sung Won Han
  • , Eunwoo Song
  • *Corresponding author for this work

Research output: Contribution to journalConference articlepeer-review

Abstract

We introduce RapFlow-TTS, a rapid and high-fidelity TTS acoustic model that leverages velocity consistency constraints in flow matching (FM) training. Although ordinary differential equation (ODE)-based TTS generation achieves natural-quality speech, it typically requires a large number of generation steps, resulting in a trade-off between quality and inference speed. To address this challenge, RapFlow-TTS enforces consistency in the velocity field along the FM-straightened ODE trajectory, enabling consistent synthetic quality with fewer generation steps. Additionally, we introduce techniques such as time interval scheduling and adversarial learning to further enhance the quality of the few-step synthesis. Experimental results show that RapFlow-TTS achieves high-fidelity speech synthesis with a 5- and 10-fold reduction in synthesis steps than the conventional FM- and score-based approaches, respectively.

Original languageEnglish
Pages (from-to)2440-2444
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
DOIs
Publication statusPublished - 2025
Event26th Interspeech Conference 2025 - Rotterdam, Netherlands
Duration: 2025 Aug 172025 Aug 21

Bibliographical note

Publisher Copyright:
© 2025 International Speech Communication Association. All rights reserved.

Keywords

  • adversarial learning
  • consistency model
  • flow matching
  • rapid
  • text-to-speech

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Language and Linguistics
  • Modelling and Simulation
  • Human-Computer Interaction

Fingerprint

Dive into the research topics of 'RapFlow-TTS: Rapid and High-Fidelity Text-to-Speech with Improved Consistency Flow Matching'. Together they form a unique fingerprint.

Cite this