Audio Super-Resolution with Robust Speech Representation Learning of Masked Autoencoder

Seung Bin Kim, Sang Hoon Lee, Ha Yeong Choi, Seong Whan Lee

Research output: Contribution to journalArticlepeer-review

Abstract

This paper proposes Fre-Painter, a high-fidelity audio super-resolution system that utilizes robust speech representation learning with various masking strategies. Recently, masked autoencoders have been found to be beneficial in learning robust representations of audio for speech classification tasks. Following these studies, we leverage these representations and investigate several masking strategies for neural audio super-resolution. In this paper, we propose an upper-band masking strategy with the initialization of the mask token, which is simple but efficient for audio super-resolution. Furthermore, we propose a mix-ratio masking strategy that makes the model robust for input speech with various sampling rates. For practical applicability, we extend Fre-Painter to a text-to-speech system, which synthesizes high-resolution speech using low-resolution speech data. The experimental results demonstrate that Fre-Painter outperforms other neural audio super-resolution models.

Original languageEnglish
Pages (from-to)1012-1022
Number of pages11
JournalIEEE/ACM Transactions on Audio Speech and Language Processing
Volume32
DOIs
Publication statusPublished - 2024

Bibliographical note

Publisher Copyright:
© 2014 IEEE.

Keywords

  • Audio super-resolution
  • audio synthesis
  • bandwidth extension
  • masked autoencoder
  • self-supervised learning

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • Acoustics and Ultrasonics
  • Computational Mathematics
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'Audio Super-Resolution with Robust Speech Representation Learning of Masked Autoencoder'. Together they form a unique fingerprint.

Cite this