Abstract
Large Language Models (LLMs) have shown remarkable performances on a wide range of natural language understanding and generation tasks. We observe that the LLMs provide effective priors in exploiting linguistic shortcuts for temporal and causal reasoning in Video Question Answering (VideoQA). However, such priors often cause suboptimal results on VideoQA by leading the model to over-rely on questions, i.e., linguistic bias, while ignoring visual content. This is also known as 'ungrounded guesses' or 'hallucinations'. To address this problem while leveraging LLMs' prior on VideoQA, we propose a novel framework, Flipped-VQA, encouraging the model to predict all the combinations of 〈V, Q, A〉 triplet by flipping the source pair and the target label to understand their complex relationships, i.e., predict A, Q, and V given a VQ, VA, and QA pairs, respectively. In this paper, we develop LLaMA-VQA by applying Flipped-VQA to LLaMA, and it outperforms both LLMs-based and non-LLMs-based models on five challenging VideoQA benchmarks. Furthermore, our Flipped-VQA is a general framework that is applicable to various LLMs (OPT and GPT-J) and consistently improves their performances. We empirically demonstrate that Flipped-VQA not only enhances the exploitation of linguistic shortcuts but also mitigates the linguistic bias, which causes incorrect answers over-relying on the question. Code is available at https://github.com/mlvlab/Flipped-VQA.
| Original language | English |
|---|---|
| Title of host publication | EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings |
| Editors | Houda Bouamor, Juan Pino, Kalika Bali |
| Publisher | Association for Computational Linguistics (ACL) |
| Pages | 4300-4316 |
| Number of pages | 17 |
| ISBN (Electronic) | 9798891760608 |
| DOIs | |
| Publication status | Published - 2023 |
| Event | 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023 - Hybrid, Singapore, Singapore Duration: 2023 Dec 6 → 2023 Dec 10 |
Publication series
| Name | EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings |
|---|
Conference
| Conference | 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023 |
|---|---|
| Country/Territory | Singapore |
| City | Hybrid, Singapore |
| Period | 23/12/6 → 23/12/10 |
Bibliographical note
Publisher Copyright:©2023 Association for Computational Linguistics.
ASJC Scopus subject areas
- Computational Theory and Mathematics
- Computer Science Applications
- Information Systems
- Linguistics and Language
Fingerprint
Dive into the research topics of 'Large Language Models are Temporal and Causal Reasoners for Video Question Answering'. Together they form a unique fingerprint.Cite this
- APA
- Standard
- Harvard
- Vancouver
- Author
- BIBTEX
- RIS