Human-Object Interaction (HOI) detection is a task of identifying “a set of interactions” in an image, which involves the i) localization of the subject (i.e., humans) and target (i.e., objects) of interaction, and ii) the classification of the interaction labels. Most existing methods have indirectly addressed this task by detecting human and object instances and individually inferring every pair of the detected instances. In this paper, we present a novel framework, referred by HOTR, which directly predicts a set of 〈human, object, interaction〉 triplets from an image based on a transformer encoder-decoder architecture. Through the set prediction, our method effectively exploits the inherent semantic relationships in an image and does not require time-consuming post-processing which is the main bottleneck of existing methods. Our proposed algorithm achieves the state-of-the-art performance in two HOI detection benchmarks with an inference time under 1 ms after object detection.
|Title of host publication||Proceedings - 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2021|
|Publisher||IEEE Computer Society|
|Number of pages||10|
|Publication status||Published - 2021|
|Event||2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2021 - Virtual, Online, United States|
Duration: 2021 Jun 19 → 2021 Jun 25
|Name||Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition|
|Conference||2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2021|
|Period||21/6/19 → 21/6/25|
Bibliographical noteFunding Information:
Acknowledgments. This research was partly supported by the Institute of Information & communications Technology Planning & Evaluation (IITP) grants funded by the Korea government (MSIT) (No.2021-0-00025, Development of Integrated Cognitive Drone AI for Disaster/Emergency Situations), (IITP-2021-0-01819, the ICT Creative Consilience program), and National Research Foundation of Korea (NRF2020R1A2C3010638, NRF-2016M3A9A7916996).
© 2021 IEEE
ASJC Scopus subject areas
- Computer Vision and Pattern Recognition