Unsupervised Video Anomaly Detection Based on Similarity with Predefined Text Descriptions

Jaehyun Kim, Seongwook Yoon, Taehyeon Choi, Sanghoon Sull

Research output: Contribution to journalArticlepeer-review

1 Citation (Scopus)

Abstract

Research on video anomaly detection has mainly been based on video data. However, many real-world cases involve users who can conceive potential normal and abnormal situations within the anomaly detection domain. This domain knowledge can be conveniently expressed as text descriptions, such as “walking” or “people fighting”, which can be easily obtained, customized for specific applications, and applied to unseen abnormal videos not included in the training dataset. We explore the potential of using these text descriptions with unlabeled video datasets. We use large language models to obtain text descriptions and leverage them to detect abnormal frames by calculating the cosine similarity between the input frame and text descriptions using the CLIP visual language model. To enhance the performance, we refined the CLIP-derived cosine similarity using an unlabeled dataset and the proposed text-conditional similarity, which is a similarity measure between two vectors based on additional learnable parameters and a triplet loss. The proposed method has a simple training and inference process that avoids the computationally intensive analyses of optical flow or multiple frames. The experimental results demonstrate that the proposed method outperforms unsupervised methods by showing 8% and 13% better AUC scores for the ShanghaiTech and UCFcrime datasets, respectively. Although the proposed method shows −6% and −5% than weakly supervised methods for those datasets, in abnormal videos, the proposed method shows 17% and 5% better AUC scores, which means that the proposed method shows comparable results with weakly supervised methods that require resource-intensive dataset labeling. These outcomes validate the potential of using text descriptions in unsupervised video anomaly detection.

Original languageEnglish
Article number6256
JournalSensors
Volume23
Issue number14
DOIs
Publication statusPublished - 2023 Jul

Bibliographical note

Publisher Copyright:
© 2023 by the authors.

Keywords

  • CLIP
  • abnormal video
  • embedding space
  • fine-tuning of pre-trained models
  • large language models
  • large vision and language models
  • similarity measure
  • text descriptions
  • unsupervised video anomaly detection

ASJC Scopus subject areas

  • Analytical Chemistry
  • Information Systems
  • Atomic and Molecular Physics, and Optics
  • Biochemistry
  • Instrumentation
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'Unsupervised Video Anomaly Detection Based on Similarity with Predefined Text Descriptions'. Together they form a unique fingerprint.

Cite this