TY - JOUR
T1 - Conditioned Source Separation by Attentively Aggregating Frequency Transformations With Self-Conditioning
AU - Choi, Woosung
AU - Jeong, Yeong Seok
AU - Kim, Jinsung
AU - Chung, Jaehwa
AU - Jung, Soonyoung
AU - Reiss, Joshua D.
N1 - Funding Information:
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (No. 2021R1A6A3A03046770, 2021R1A2C2011452).
Publisher Copyright:
© 2022 Audio Engineering Society. All rights reserved.
PY - 2022/9
Y1 - 2022/9
N2 - Label-conditioned source separation extracts the target source, specified by an input symbol, from an input mixture track. A recently proposed label-conditioned source separation model called Latent Source Attentive Frequency Transformation (LaSAFT)–Gated Point-Wise Convolutional Modulation (GPoCM)–Net introduced a block for latent source analysis called LaSAFT. Employing LaSAFT blocks, it established state-of-the-art performance on several tasks of the MUSDB18 benchmark. This paper enhances the LaSAFT block by exploiting a self-conditioning method. Whereas the existing method only cares about the symbolic relationships between the target source symbol and latent sources, ignoring audio content, the new approach also considers audio content. The enhanced block computes the attention mask conditioning on the label and the input audio feature map. Here, it is shown that the conditioned U-Net employing the enhanced LaSAFT blocks outperforms the previous model. It is also shown that the present model performs the audio-query–based separation with a slight modification.
AB - Label-conditioned source separation extracts the target source, specified by an input symbol, from an input mixture track. A recently proposed label-conditioned source separation model called Latent Source Attentive Frequency Transformation (LaSAFT)–Gated Point-Wise Convolutional Modulation (GPoCM)–Net introduced a block for latent source analysis called LaSAFT. Employing LaSAFT blocks, it established state-of-the-art performance on several tasks of the MUSDB18 benchmark. This paper enhances the LaSAFT block by exploiting a self-conditioning method. Whereas the existing method only cares about the symbolic relationships between the target source symbol and latent sources, ignoring audio content, the new approach also considers audio content. The enhanced block computes the attention mask conditioning on the label and the input audio feature map. Here, it is shown that the conditioned U-Net employing the enhanced LaSAFT blocks outperforms the previous model. It is also shown that the present model performs the audio-query–based separation with a slight modification.
UR - http://www.scopus.com/inward/record.url?scp=85139012722&partnerID=8YFLogxK
U2 - 10.17743/jaes.2022.0030
DO - 10.17743/jaes.2022.0030
M3 - Article
AN - SCOPUS:85139012722
SN - 0004-7554
VL - 70
SP - 661
EP - 673
JO - AES: Journal of the Audio Engineering Society
JF - AES: Journal of the Audio Engineering Society
IS - 9
ER -