Abstract
Recent temporal action detection models have focused on end-to-end trainable approaches to utilize the representational power of backbone networks. Despite the advantages of end-to-end trainable methods, these models still employ a small spatial resolution (e.g., 96 × 96) due to the inefficient trade-off between computational cost and spatial resolution. In this study, we argue that a simple pooling method (e.g., adaptive average pooling) acts as a bottleneck at the spatial aggregation part, restricting representational power. To address this issue, we propose a temporal-wise spatial attentive pooling (TSAP), which alleviates the bottleneck between the backbone and the detection head using a temporal-wise attention mechanism. Our approach mitigates the inefficient trade-off between spatial resolution and computational cost, thereby enhancing spatial scalability in temporal action detection. Moreover, TSAP is adaptable to previous end-to-end approaches by simply replacing the spatial pooling part. Our experiments demonstrated the essential role of spatial aggregation, and consistent improvements are observed by incorporating TSAP into previous end-to-end methods.
Original language | English |
---|---|
Article number | 106321 |
Journal | Neural Networks |
Volume | 176 |
DOIs | |
Publication status | Published - 2024 Aug |
Bibliographical note
Publisher Copyright:© 2024 Elsevier Ltd
Keywords
- End-to-end training
- Temporal action detection
ASJC Scopus subject areas
- Cognitive Neuroscience
- Artificial Intelligence