Rethinking Transfer and Auxiliary Learning for Improving Audio Captioning Transformer

Wooseok Shin, Hyun Joon Park, Jin Sob Kim, Dongwon Kim, Seungjin Lee, Sung Won Han

Research output: Contribution to journalConference articlepeer-review

Abstract

The performance of automated audio captioning (AAC) has been improved considerably through a transformer-based encoder and transfer learning. However, their performance improvement is constrained by the following problems: (1) discrepancy in the input patch size between pretraining and fine-tuning steps. (2) lack of local-level relations between inputs and captions. In this paper, we propose a simple transfer learning scheme that maintains input patch sizes, unlike previous methods, to avoid input discrepancies. Furthermore, we propose a patch-wise keyword estimation branch that utilizes an attention pooling method to effectively represent both global- and local-level information. The results on the AudioCaps dataset reveal that the proposed learning scheme and method considerably contribute to performance gain. Finally, the visualization results demonstrate that the proposed attention-pooling method effectively detects local-level information in the AAC system.

Original languageEnglish
Pages (from-to)2128-2132
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2023-August
DOIs
Publication statusPublished - 2023
Event24th International Speech Communication Association, Interspeech 2023 - Dublin, Ireland
Duration: 2023 Aug 202023 Aug 24

Bibliographical note

Publisher Copyright:
© 2023 International Speech Communication Association. All rights reserved.

Keywords

  • attention pooling
  • audio captioning
  • multiple instance learning
  • transfer learning
  • transformer

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modelling and Simulation

Fingerprint

Dive into the research topics of 'Rethinking Transfer and Auxiliary Learning for Improving Audio Captioning Transformer'. Together they form a unique fingerprint.

Cite this