VATMAN: Video-Audio-Text Multimodal Abstractive Summarization with Trimodal Hierarchical Multi-head Attention

Doosan Baek, Jiho Kim, Hongchul Lee

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)


Multimodal Abstractive Summarization is a challenging task that aims to generate concise and informative summaries from diverse modalities, such as video, audio, and text. In this study, we propose VATMAN, a novel approach for multimodal abstractive summarization. To effectively capture the hierarchical relationships and dependencies between modalities, we introduce Trimodal Hierarchical Multi-head Attention (THMA). THMA hierarchically attends to the video, audio, and textual representations, enabling the model to distill salient information and generate cohesive and coherent summaries. VATMAN leverages state-of-the-art generative pretrained language models (GPLMs), specifically Transformer-based models, and applies hierarchical attention at the modality level, which enhances the utilization of contextual information. The proposed VATMAN model on the How2 dataset demonstrates the ability to create more fluent summaries than those generated by human authors, showcasing its potential for utilization in various industrial environments.

Original languageEnglish
Title of host publicationICTC 2023 - 14th International Conference on Information and Communication Technology Convergence
Subtitle of host publicationExploring the Frontiers of ICT Innovation
PublisherIEEE Computer Society
Number of pages4
ISBN (Electronic)9798350313277
Publication statusPublished - 2023
Event14th International Conference on Information and Communication Technology Convergence, ICTC 2023 - Jeju Island, Korea, Republic of
Duration: 2023 Oct 112023 Oct 13

Publication series

NameInternational Conference on ICT Convergence
ISSN (Print)2162-1233
ISSN (Electronic)2162-1241


Conference14th International Conference on Information and Communication Technology Convergence, ICTC 2023
Country/TerritoryKorea, Republic of
CityJeju Island

Bibliographical note

Publisher Copyright:
© 2023 IEEE.


  • Abstractive Summarization
  • Generative Pretrained Language Model
  • Transformer
  • Trimodal

ASJC Scopus subject areas

  • Information Systems
  • Computer Networks and Communications

Cite this