Groupwise Query Specialization and Quality-Aware Multi-Assignment for Transformer-Based Visual Relationship Detection

  • Jongha Kim
  • , Jihwan Park
  • , Jinyoung Park
  • , Jinyoung Kim
  • , Sehyung Kim
  • , Hyunwoo J. Kim*
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Visual Relationship Detection (VRD) has seen significant advancements with Transformer-based architectures recently. However, we identify two key limitations in a conventional label assignment for training Transformer-based VRD models, which is a process of mapping a ground-truth (GT) to a prediction. Under the conventional assignment, an 'unspecialized' query is trained since a query is expected to detect every relation, which makes it difficult for a query to specialize in specific relations. Furthermore, a query is also insufficiently trained since a GT is assigned only to a single prediction, therefore near-correct or even correct predictions are suppressed by being assigned 'no relation (Ø)' as a GT. To address these issues, we propose Groupwise Query Specialization and Quality-Aware Multi-Assignment (SpeaQ). Groupwise Query Specialization trains a 'specialized' query by dividing queries and relations into disjoint groups and directing a query in a specific query group solely toward relations in the corresponding relation group. Quality-Aware Multi-Assignment further facilitates the training by assigning a GT to multiple predictions that are significantly close to a GT in terms of a subject, an object, and the relation in between. Experimental results and analyses show that SpeaQ effectively trains 'specialized' queries, which better utilize the capacity of a model, resulting in consistent performance gains with 'zero' additional inference cost across multiple VRD models and benchmarks. Code is available at https://github.com/m1vlab/SpeaQ.

Original languageEnglish
Title of host publicationProceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
PublisherIEEE Computer Society
Pages28160-28169
Number of pages10
ISBN (Electronic)9798350353006
DOIs
Publication statusPublished - 2024
Event2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024 - Seattle, United States
Duration: 2024 Jun 162024 Jun 22

Publication series

NameProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
ISSN (Print)1063-6919

Conference

Conference2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
Country/TerritoryUnited States
CitySeattle
Period24/6/1624/6/22

Bibliographical note

Publisher Copyright:
© 2024 IEEE.

Keywords

  • Human-Object Interaction Detection
  • Label Assignment
  • Scene Graph Detection
  • Scene Graph Generation

ASJC Scopus subject areas

  • Software
  • Computer Vision and Pattern Recognition

Fingerprint

Dive into the research topics of 'Groupwise Query Specialization and Quality-Aware Multi-Assignment for Transformer-Based Visual Relationship Detection'. Together they form a unique fingerprint.

Cite this