Abstract
In our work, we have empirically found that Vision Transformer (ViT) could not extract object-centric features when applied to out-of-distribution (OOD) detection. To make object-centric attention, we design an additional module that employs a cross-attention between class-wise token proxy and feature token sequence of an input image. For inference suitable to our cross-attention structure with multiple class-wise token proxies, we propose a score ensemble that can be applied to any scoring function. Compared to ViT, the proposed inference scheme achieves outperforming performance by synergizing with our cross-attention structure. Through experiments, we demonstrate that the proposed cross-attention structure with score ensemble inference improves largely near OOD detection performance, where FPR95 improvement in near OOD detection compared to the state-of-the-art method becomes 2.55% for CIFAR-10 and 2.67% for CIFAR-100, keeping competitive classification accuracy.
Original language | English |
---|---|
Pages (from-to) | 62793-62803 |
Number of pages | 11 |
Journal | IEEE Access |
Volume | 12 |
DOIs | |
Publication status | Published - 2024 |
Externally published | Yes |
Bibliographical note
Publisher Copyright:© 2013 IEEE.
Keywords
- Near out-of-distribution (OOD) detection
- class-wise cross attention
- vision transformer
ASJC Scopus subject areas
- General Computer Science
- General Materials Science
- General Engineering