Audio-to-Visual Cross-Modal Generation of Birds

Joo Yong Shim, Joongheon Kim, Jong Kook Kim

Research output: Contribution to journalArticlepeer-review


Audio and visual modal data are essential elements of precise investigation in many fields. Sometimes it is difficult to obtain visual data while auditory data is easily available. In this case, generating visual data using audio data will be very helpful. This paper proposes a novel audio-to-visual cross-modal generation approach. The proposed sound encoder extracts the features of the auditory data and a generative model generates images using those audio features. This model is expected to learn (i) valid feature representation and (ii) associations between generated images and audio inputs to generate realistic and well-classified images. A new dataset is collected for this research called the Audio-Visual Corresponding Bird (AVC-B) dataset which contains the sounds and corresponding images of 10 different bird species. The experimental results show that the proposed method can generate class-appropriate images and achieve better classification results than the state-of-the-art methods.

Original languageEnglish
Pages (from-to)27719-27729
Number of pages11
JournalIEEE Access
Publication statusPublished - 2023

Bibliographical note

Funding Information:
This work was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education under Grant NRF-2016R1D1A1B04933156.

Publisher Copyright:
© 2013 IEEE.


  • Cross-modal generation
  • conditional GANs
  • feature representation
  • generative adversarial networks (GANs)

ASJC Scopus subject areas

  • General Engineering
  • General Materials Science
  • General Computer Science


Dive into the research topics of 'Audio-to-Visual Cross-Modal Generation of Birds'. Together they form a unique fingerprint.

Cite this