Enhancing multi-level cross-modal interaction with false negative-aware contrastive learning for text-video retrieval

  • Eungyeop Kim
  • , Changhee Lee*
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Text-video retrieval (TVR) has become a crucial branch in multi-modal understanding tasks. Enhanced by CLIP, a well-known contrastive learning framework that connects text and image, TVR has made substantial progress, particularly in developing cross-grained methods that account for both coarse and fine granularity in text and video. Nonetheless, previous cross-grained approaches have overlooked two crucial aspects. First, they utilize text-agnostic video summaries by simply averaging frame-level embeddings, potentially failing to capture crucial frame-level information that is semantically relevant to the corresponding text. Second, these approaches employ contrastive learning that neglects the impact of false negatives containing semantically relevant information. To address the aforementioned aspects, we introduce a novel framework for TVR, referred to as X-MLNet, focusing on capturing multi-level cross-modal interactions across video and text. This is done by first incorporating cross-attention modules at various levels of granularity, ranging from fine-grained (i.e., frame/word-level) representations to coarse-grained (i.e., video/sentence-level) representations. Then, we apply a contrastive learning framework that utilizes a similarity score computed based on the multi-level cross-modal interactions, excluding potential false negatives based on intra-modal connectivity among samples. Our experiments on five real-world benchmark datasets, including MSRVTT, MSVD, LSMDC, ActivityNet, and DiDeMo, demonstrate state-of-the-art performance in both text-to-video and video-to-text retrieval tasks. Our code is available at https://github.com/celestialxevermore/X-VLNet.

Original languageEnglish
Article number954
JournalApplied Intelligence
Volume55
Issue number14
DOIs
Publication statusPublished - 2025 Sept

Bibliographical note

Publisher Copyright:
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2025.

Keywords

  • Cross-modal learning
  • Multimodal contrastive learning
  • Text-video retrieval

ASJC Scopus subject areas

  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'Enhancing multi-level cross-modal interaction with false negative-aware contrastive learning for text-video retrieval'. Together they form a unique fingerprint.

Cite this