MAGVLT: Masked Generative Vision-and-Language Transformer

  • Sungwoong Kim*
  • , Daejin Jo
  • , Donghoon Lee
  • , Jongmin Kim
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

While generative modeling on multimodal image-text data has been actively developed with large-scale paired datasets, there have been limited attempts to generate both image and text data by a single model rather than a generation of one fixed modality conditioned on the other modality. In this paper, we explore a unified generative vision-and-language (VL) model that can produce both images and text sequences. Especially, we propose a generative VL transformer based on the non-autoregressive mask prediction, named MAGVLT, and compare it with an autoregressive generative VL transformer (ARGVLT). In comparison to ARGVLT, the proposed MAGVLT enables bidirectional context encoding, fast decoding by parallel token predictions in an iterative refinement, and extended editing capabilities such as image and text infilling. For rigorous training of our MAGVLT with image-text pairs from scratch, we combine the image-to-text, text-to-image, and joint image-and-text mask prediction tasks. Moreover, we devise two additional tasks based on the step-unrolled mask prediction and the selective prediction on the mixture of two image-text pairs. Experimental results on various downstream generation tasks of VL benchmarks show that our MAGVLT outperforms ARGVLT by a large margin even with significant inference speedup. Particularly, MAGVLT achieves competitive results on both zero-shot image-to-text and text-to-image generation tasks from MS-COCO by one moderate-sized model (fewer than 500M parameters) even without the use of monomodal data and networks.

Original languageEnglish
Title of host publicationProceedings - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023
PublisherIEEE Computer Society
Pages23338-23348
Number of pages11
ISBN (Electronic)9798350301298
DOIs
Publication statusPublished - 2023
Event2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023 - Vancouver, Canada
Duration: 2023 Jun 182023 Jun 22

Publication series

NameProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Volume2023-June
ISSN (Print)1063-6919

Conference

Conference2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023
Country/TerritoryCanada
CityVancouver
Period23/6/1823/6/22

Bibliographical note

Publisher Copyright:
© 2023 IEEE.

Keywords

  • Multi-modal learning

ASJC Scopus subject areas

  • Software
  • Computer Vision and Pattern Recognition

Fingerprint

Dive into the research topics of 'MAGVLT: Masked Generative Vision-and-Language Transformer'. Together they form a unique fingerprint.

Cite this