Abstract
Large-scale pre-trained Vision & Language (VL) models have shown remarkable performance in many applications, enabling replacing a fixed set of supported classes with zero-shot open vocabulary reasoning over (almost arbitrary) natural language prompts. However, recent works have uncovered a fundamental weakness of these models. For example, their difficulty to understand Visual Language Concepts (VLC) that go 'beyond nouns' such as the meaning of non-object words (e.g., attributes, actions, relations, states, etc.), or difficulty in performing compositional reasoning such as understanding the significance of the order of the words in a sentence. In this work, we investigate to which extent purely synthetic data could be leveraged to teach these models to overcome such shortcomings without compromising their zero-shot capabilities. We contribute Synthetic Visual Concepts (SyViC) - a million-scale synthetic dataset and data generation codebase allowing to generate additional suitable data to improve VLC understanding and compositional reasoning of VL models. Additionally, we propose a general VL finetuning strategy for effectively leveraging SyViC towards achieving these improvements. Our extensive experiments and ablations on VL-Checklist, Winoground, and ARO benchmarks demonstrate that it is possible to adapt strong pre-trained VL models with synthetic data significantly enhancing their VLC understanding (e.g. by 9.9% on ARO and 4.3% on VL-Checklist) with under 1% drop in their zero-shot accuracy.
| Original language | English |
|---|---|
| Title of host publication | Proceedings - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023 |
| Publisher | Institute of Electrical and Electronics Engineers Inc. |
| Pages | 20098-20108 |
| Number of pages | 11 |
| ISBN (Electronic) | 9798350307184 |
| DOIs | |
| Publication status | Published - 2023 |
| Externally published | Yes |
| Event | 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023 - Paris, France Duration: 2023 Oct 2 → 2023 Oct 6 |
Publication series
| Name | Proceedings of the IEEE International Conference on Computer Vision |
|---|---|
| ISSN (Print) | 1550-5499 |
Conference
| Conference | 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023 |
|---|---|
| Country/Territory | France |
| City | Paris |
| Period | 23/10/2 → 23/10/6 |
Bibliographical note
Publisher Copyright:© 2023 IEEE.
ASJC Scopus subject areas
- Software
- Computer Vision and Pattern Recognition
Fingerprint
Dive into the research topics of 'Going Beyond Nouns With Vision & Language Models Using Synthetic Data'. Together they form a unique fingerprint.Cite this
- APA
- Standard
- Harvard
- Vancouver
- Author
- BIBTEX
- RIS