Abstract
Language-based object detection aims to locate target objects from complex language queries. However, current vision-language detectors often struggle to understand complex representations of visual objects (e.g., attributes, shapes, and relationships), especially under complex queries. In this paper, we first conduct a thorough analysis of current language-based detectors to identify their specific weaknesses in compositional understanding. To this end, we propose a novel comprehensive evaluation framework that automatically categorizes test cases by the type and complexity of compositionality, leveraging large language models (LLMs). This reveals that detectors show significant performance drops with increased complexity and consistent failures in specific types, such as spatial and numerical reasoning. To effectively address this, we propose a multifaceted synthetic data consisting of (1) generative model-based synthetic triplets that inherited compositional knowledge from large generative models (e.g., LLMs, diffusion models) in the form of triplets (i.e., image-text-box data); and (2) weakness-targeted synthetic descriptions designed to enhance understanding in vulnerable types like spatial and numeracy concepts. We further introduce a compositional contrastive learning method to better leverage the proposed synthetic data while mitigating the common drawbacks of synthetic data. Consequently, our models trained on proposed multifaceted synthetic data exhibit a significant performance boost in the Omnilabel benchmark by up to +7.1AP and the D3 benchmark by up to +8.4AP upon existing baselines.
| Original language | English |
|---|---|
| Pages (from-to) | 7873-7896 |
| Number of pages | 24 |
| Journal | International Journal of Computer Vision |
| Volume | 133 |
| Issue number | 11 |
| DOIs | |
| Publication status | Published - 2025 Nov |
Bibliographical note
Publisher Copyright:© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2025.
Keywords
- Compositional Learning
- Language-based Object Detection
- Multimodal Synthetic Datasets
- Transfer Learning
ASJC Scopus subject areas
- Software
- Computer Vision and Pattern Recognition
- Artificial Intelligence
Fingerprint
Dive into the research topics of 'Learning Compositionality from Multifaceted Synthetic Data for Language-based Object Detection'. Together they form a unique fingerprint.Cite this
- APA
- Standard
- Harvard
- Vancouver
- Author
- BIBTEX
- RIS