Abstract
Vision-Language Models (VLMs) have shown remarkable performance in zero-shot action recognition by learning the correlation between video embeddings and class embeddings. However, an issue arises when depending solely on action classes for semantic information due to multi-semantic words - words with multiple meanings. Theses words hinder the difficulty of the model to accurately capture the intended concepts of actions. We propose a novel approach which leverages web-crawled descriptions with utilizing a large-language model for the extraction of keywords. This method reduces the reliance on human annotators and avoids the exhaustive manual process of attribute data creation. Moreover, we introduce a spatio-temporal interaction module which focuses on objects and action units to align description attributes with video content. In zero-shot experiment, our model achieves 81.0%, 53.1%, and 68.9% on UCF-101, HMDB-51, and Kinetics-600, respectively, which demonstrates the transferability of our model to downstream tasks.
| Original language | English |
|---|---|
| Title of host publication | Pattern Recognition and Artificial Intelligence - 4th International Conference, ICPRAI 2024, Proceedings |
| Editors | Christian Wallraven, Cheng-Lin Liu, Arun Ross |
| Publisher | Springer Science and Business Media Deutschland GmbH |
| Pages | 296-309 |
| Number of pages | 14 |
| ISBN (Print) | 9789819787043 |
| DOIs | |
| Publication status | Published - 2025 |
| Event | 4th International Conference on Pattern Recognition and Artificial Intelligence, ICPRAI 2024 - Jeju Island, Korea, Republic of Duration: 2024 Jul 3 → 2024 Jul 6 |
Publication series
| Name | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
|---|---|
| Volume | 14893 LNCS |
| ISSN (Print) | 0302-9743 |
| ISSN (Electronic) | 1611-3349 |
Conference
| Conference | 4th International Conference on Pattern Recognition and Artificial Intelligence, ICPRAI 2024 |
|---|---|
| Country/Territory | Korea, Republic of |
| City | Jeju Island |
| Period | 24/7/3 → 24/7/6 |
Bibliographical note
Publisher Copyright:© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025.
Keywords
- Action recognition
- Vision-language model
- Zero-shot transfer
ASJC Scopus subject areas
- Theoretical Computer Science
- General Computer Science
Fingerprint
Dive into the research topics of 'Description Attribute-Enhanced Spatio-Temporal Zero-Shot Action Recognition'. Together they form a unique fingerprint.Cite this
- APA
- Standard
- Harvard
- Vancouver
- Author
- BIBTEX
- RIS