Abstract
Image captioning has received significant interest in recent years, and notable results have been achieved. Most previous approaches have focused on generating visual descriptions from images, whereas a few approaches have exploited visual descriptions for image classification. This study demonstrates that a good performance can be achieved for both description generation and image classification through an end-to-end joint learning approach with a loss function, which encourages each task to reach a consensus. When given images and visual descriptions, the proposed model learns a multimodal intermediate embedding, which can represent both the textual and visual characteristics of an object. The performance can be improved for both tasks by sharing the multimodal embedding. Through a novel loss function based on class activation mapping, which localizes the discriminative image region of a model, we achieve a higher score when the captioning and classification model reaches a consensus on the key parts of the object. Using the proposed model, we established a substantially improved performance for each task on the UCSD Birds and Oxford Flowers datasets.
| Original language | English |
|---|---|
| Pages (from-to) | 67-77 |
| Number of pages | 11 |
| Journal | ETRI Journal |
| Volume | 42 |
| Issue number | 1 |
| DOIs | |
| Publication status | Published - 2020 Feb 1 |
Bibliographical note
Publisher Copyright:© 2019 ETRI
Keywords
- deep learning
- image captioning
- image classification
ASJC Scopus subject areas
- Electronic, Optical and Magnetic Materials
- General Computer Science
- Electrical and Electronic Engineering
Fingerprint
Dive into the research topics of 'Image classification and captioning model considering a CAM-based disagreement loss'. Together they form a unique fingerprint.Cite this
- APA
- Standard
- Harvard
- Vancouver
- Author
- BIBTEX
- RIS