HierSpeech: Bridging the Gap between Text and Speech by Hierarchical Variational Inference using Self-supervised Representations for Speech Synthesis

  • Sang Hoon Lee
  • , Seung Bin Kim
  • , Ji Hyun Lee
  • , Eunwoo Song
  • , Min Jae Hwang
  • , Seong Whan Lee*
  • *Corresponding author for this work

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    This paper presents HierSpeech, a high-quality end-to-end text-to-speech (TTS) system based on a hierarchical conditional variational autoencoder (VAE) utilizing self-supervised speech representations. Recently, single-stage TTS systems, which directly generate raw speech waveform from text, have been getting interest thanks to their ability in generating high-quality audio within a fully end-to-end training pipeline. However, there is still a room for improvement in the conventional TTS systems. Since it is challenging to infer both the linguistic and acoustic attributes from the text directly, missing the details of attributes, specifically linguistic information, is inevitable, which results in mispronunciation and over-smoothing problem in their synthetic speech. To address the aforementioned problem, we leverage self-supervised speech representations as additional linguistic representations to bridge an information gap between text and speech. Then, the hierarchical conditional VAE is adopted to connect these representations and to learn each attribute hierarchically by improving the linguistic capability in latent representations. Compared with the state-of-the-art TTS system, HierSpeech achieves +0.303 comparative mean opinion score, and reduces the phoneme error rate of synthesized speech from 9.16% to 5.78% on the VCTK dataset. Furthermore, we extend our model to HierSpeech-U, an untranscribed text-to-speech system. Specifically, HierSpeech-U can adapt to a novel speaker by utilizing self-supervised speech representations without text transcripts. The experimental results reveal that our method outperforms publicly available TTS models, and show the effectiveness of speaker adaptation with untranscribed speech.

    Original languageEnglish
    Title of host publicationAdvances in Neural Information Processing Systems 35 - 36th Conference on Neural Information Processing Systems, NeurIPS 2022
    EditorsS. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, A. Oh
    PublisherNeural information processing systems foundation
    ISBN (Electronic)9781713871088
    Publication statusPublished - 2022
    Event36th Conference on Neural Information Processing Systems, NeurIPS 2022 - New Orleans, United States
    Duration: 2022 Nov 282022 Dec 9

    Publication series

    NameAdvances in Neural Information Processing Systems
    Volume35
    ISSN (Print)1049-5258

    Conference

    Conference36th Conference on Neural Information Processing Systems, NeurIPS 2022
    Country/TerritoryUnited States
    CityNew Orleans
    Period22/11/2822/12/9

    Bibliographical note

    Funding Information:
    This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No. 2019-0-00079, Artificial Intelligence Graduate School Program(Korea University), No. 2019-0-01371, Development of Brain-inspired AI with Human-like Intelligence, No. 2021-0-02068, Artificial Intelligence Innovation Hub, and No. 2022-0-00984, Development of Artificial Intelligence Technology for Personalized Plug-and-Play Explanation and Verification of Explanation) and Clova Voice, NAVER Corp., Seongnam, Korea.

    Publisher Copyright:
    © 2022 Neural information processing systems foundation. All rights reserved.

    ASJC Scopus subject areas

    • Computer Networks and Communications
    • Information Systems
    • Signal Processing

    Fingerprint

    Dive into the research topics of 'HierSpeech: Bridging the Gap between Text and Speech by Hierarchical Variational Inference using Self-supervised Representations for Speech Synthesis'. Together they form a unique fingerprint.

    Cite this