PauseSpeech: Natural Speech Synthesis via Pre-trained Language Model and Pause-Based Prosody Modeling

  • Ji Sang Hwang
  • , Sang Hoon Lee
  • , Seong Whan Lee*
  • *Corresponding author for this work

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    Although text-to-speech (TTS) systems have significantly improved, most TTS systems still have limitations in synthesizing speech with appropriate phrasing. For natural speech synthesis, it is important to synthesize the speech with a phrasing structure that groups words into phrases based on semantic information. In this paper, we propose PuaseSpeech, a speech synthesis system with a pre-trained language model and pause-based prosody modeling. First, we introduce a phrasing structure encoder that utilizes a context representation from the pre-trained language model. In the phrasing structure encoder, we extract a speaker-dependent syntactic representation from the context representation and then predict a pause sequence that separates the input text into phrases. Furthermore, we introduce a pause-based word encoder to model word-level prosody based on pause sequence. Experimental results show PauseSpeech outperforms previous models in terms of naturalness. Furthermore, in terms of objective evaluations, we can observe that our proposed methods help the model decrease the distance between ground-truth and synthesized speech. Audio samples are available at https://jisang93.github.io/pausespeech-demo/.

    Original languageEnglish
    Title of host publicationPattern Recognition - 7th Asian Conference, ACPR 2023, Proceedings
    EditorsHuimin Lu, Michael Blumenstein, Sung-Bae Cho, Cheng-Lin Liu, Yasushi Yagi, Tohru Kamiya
    PublisherSpringer Science and Business Media Deutschland GmbH
    Pages415-427
    Number of pages13
    ISBN (Print)9783031476334
    DOIs
    Publication statusPublished - 2023
    Event7th Asian Conference on Pattern Recognition, ACPR 2023 - Kitakyushu, Japan
    Duration: 2023 Nov 52023 Nov 8

    Publication series

    NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    Volume14406 LNCS
    ISSN (Print)0302-9743
    ISSN (Electronic)1611-3349

    Conference

    Conference7th Asian Conference on Pattern Recognition, ACPR 2023
    Country/TerritoryJapan
    CityKitakyushu
    Period23/11/523/11/8

    Bibliographical note

    Publisher Copyright:
    © 2023, The Author(s), under exclusive license to Springer Nature Switzerland AG.

    Keywords

    • Pause-based prosody modeling
    • Pre-trained language model
    • Text-to-speech

    ASJC Scopus subject areas

    • Theoretical Computer Science
    • General Computer Science

    Fingerprint

    Dive into the research topics of 'PauseSpeech: Natural Speech Synthesis via Pre-trained Language Model and Pause-Based Prosody Modeling'. Together they form a unique fingerprint.

    Cite this