TY - JOUR
T1 - Cross-modal distillation with audio–text fusion for fine-grained emotion classification using BERT and Wav2vec 2.0
AU - Kim, Donghwa
AU - Kang, Pilsung
N1 - Funding Information:
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (NRF-2022R1A2C2005455). This work as also supported by the Korea Institute for Advancement of Technology (KIAT) grant funded by the Korea Government (MOTIE) (P0008691, The Competency Development Program for Industry Specialist).
Publisher Copyright:
© 2022 Elsevier B.V.
PY - 2022/9/28
Y1 - 2022/9/28
N2 - Fine-grained emotion classification for mood- and emotion-related physical-characteristics detection and its application to computer technology using biometric sensors has been extensively researched in the field of affective computing. Although text modality has achieved a considerably high performance from the perspective of sentiment analysis, which simply classifies a positive or negative label, fine-grained emotion classification requires additional information besides text. An audio feature can be adopted as the additional information as it is closely associated with text, and the characteristics of the changes in sound pulses can be employed in fine-grained emotion classification. However, the multimodal datasets related to fine-grained emotion are limited, and the scalability and efficiency are insufficient for multimodal training to be applied extensively via the self-supervised learning (Self-SL) approach, which can adequately represent modality. To address these limitations, we propose cross-modal distillation (CMD), which induces the feature spaces of student models with a few parameters while receiving those of the teacher models that can adequately express each modality based on Self-SL. The proposed CMD performs the mapping of a feature space between teacher-student models based on contrastive learning, while two attention mechanisms—cross-attention between audio and text features and self-attention for features in modality—are performed during knowledge distillation. Wav2vec 2.0 and BERT, which are already adequately trained for audio and text via Self-SL, were adopted as teacher models; audio–text transformer models were used as student models. Accordingly, the CMD-based representation learning applies a lightweight model for IEMOCAP, MELD, and CMU–MOSEI datasets with the task of multi-class emotion classification, while exhibiting better fine-grained emotion classification performance than benchmark models with a considerably low uncertainty for prediction.
AB - Fine-grained emotion classification for mood- and emotion-related physical-characteristics detection and its application to computer technology using biometric sensors has been extensively researched in the field of affective computing. Although text modality has achieved a considerably high performance from the perspective of sentiment analysis, which simply classifies a positive or negative label, fine-grained emotion classification requires additional information besides text. An audio feature can be adopted as the additional information as it is closely associated with text, and the characteristics of the changes in sound pulses can be employed in fine-grained emotion classification. However, the multimodal datasets related to fine-grained emotion are limited, and the scalability and efficiency are insufficient for multimodal training to be applied extensively via the self-supervised learning (Self-SL) approach, which can adequately represent modality. To address these limitations, we propose cross-modal distillation (CMD), which induces the feature spaces of student models with a few parameters while receiving those of the teacher models that can adequately express each modality based on Self-SL. The proposed CMD performs the mapping of a feature space between teacher-student models based on contrastive learning, while two attention mechanisms—cross-attention between audio and text features and self-attention for features in modality—are performed during knowledge distillation. Wav2vec 2.0 and BERT, which are already adequately trained for audio and text via Self-SL, were adopted as teacher models; audio–text transformer models were used as student models. Accordingly, the CMD-based representation learning applies a lightweight model for IEMOCAP, MELD, and CMU–MOSEI datasets with the task of multi-class emotion classification, while exhibiting better fine-grained emotion classification performance than benchmark models with a considerably low uncertainty for prediction.
KW - BERT
KW - Contrastive learning
KW - Knowledge distillation
KW - Multi-class emotion classification
KW - Transformer
KW - Wav2Vec 2.0
UR - http://www.scopus.com/inward/record.url?scp=85135109335&partnerID=8YFLogxK
U2 - 10.1016/j.neucom.2022.07.035
DO - 10.1016/j.neucom.2022.07.035
M3 - Article
AN - SCOPUS:85135109335
SN - 0925-2312
VL - 506
SP - 168
EP - 183
JO - Neurocomputing
JF - Neurocomputing
ER -