TY - JOUR
T1 - Audio dequantization for high fidelity audio generation in flow-based neural vocoder
AU - Yoon, Hyun Wook
AU - Lee, Sang Hoon
AU - Noh, Hyeong Rae
AU - Lee, Seong Whan
N1 - Funding Information:
This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2019-0-00079, Department of Artificial Intelligence, Korea University), the Magellan Division of Netmar-ble Corporation, and the Seoul R&BD Program(CY190019).
Funding Information:
This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2019-0-00079, Department of Artificial Intelligence, Korea University), the Magellan Division of Netmarble Corporation, and the Seoul R&BD Program(CY190019).
Publisher Copyright:
Copyright © 2020 ISCA
PY - 2020
Y1 - 2020
N2 - In recent works, a flow-based neural vocoder has shown significant improvement in real-time speech generation task. The sequence of invertible flow operations allows the model to convert samples from simple distribution to audio samples. However, training a continuous density model on discrete audio data can degrade model performance due to the topological difference between latent and actual distribution. To resolve this problem, we propose audio dequantization methods in flow-based neural vocoder for high fidelity audio generation. Data dequantization is a well-known method in image generation but has not yet been studied in the audio domain. For this reason, we implement various audio dequantization methods in flow-based neural vocoder and investigate the effect on the generated audio. We conduct various objective performance assessments and subjective evaluation to show that audio dequantization can improve audio generation quality. From our experiments, using audio dequantization produces waveform audio with better harmonic structure and fewer digital artifacts.
AB - In recent works, a flow-based neural vocoder has shown significant improvement in real-time speech generation task. The sequence of invertible flow operations allows the model to convert samples from simple distribution to audio samples. However, training a continuous density model on discrete audio data can degrade model performance due to the topological difference between latent and actual distribution. To resolve this problem, we propose audio dequantization methods in flow-based neural vocoder for high fidelity audio generation. Data dequantization is a well-known method in image generation but has not yet been studied in the audio domain. For this reason, we implement various audio dequantization methods in flow-based neural vocoder and investigate the effect on the generated audio. We conduct various objective performance assessments and subjective evaluation to show that audio dequantization can improve audio generation quality. From our experiments, using audio dequantization produces waveform audio with better harmonic structure and fewer digital artifacts.
KW - Audio synthesis
KW - Data dequantization
KW - Deep learning
KW - Flow-based generative models
KW - Neural vocoder
UR - http://www.scopus.com/inward/record.url?scp=85098121899&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2020-1226
DO - 10.21437/Interspeech.2020-1226
M3 - Conference article
AN - SCOPUS:85098121899
SN - 2308-457X
VL - 2020-October
SP - 3545
EP - 3549
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
T2 - 21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020
Y2 - 25 October 2020 through 29 October 2020
ER -