Audio and visual modal data are essential elements of precise investigation in many fields. Sometimes it is difficult to obtain visual data while auditory data is easily available. In this case, generating visual data using audio data will be very helpful. This paper proposes a novel audio-to-visual cross-modal generation approach. The proposed sound encoder extracts the features of the auditory data and a generative model generates images using those audio features. This model is expected to learn (i) valid feature representation and (ii) associations between generated images and audio inputs to generate realistic and well-classified images. A new dataset is collected for this research called the Audio-Visual Corresponding Bird (AVC-B) dataset which contains the sounds and corresponding images of 10 different bird species. The experimental results show that the proposed method can generate class-appropriate images and achieve better classification results than the state-of-the-art methods.
Bibliographical noteFunding Information:
This work was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education under Grant NRF-2016R1D1A1B04933156.
© 2013 IEEE.
- Cross-modal generation
- conditional GANs
- feature representation
- generative adversarial networks (GANs)
ASJC Scopus subject areas
- Materials Science(all)
- Computer Science(all)