VATMAN: Video-Audio-Text Multimodal Abstractive Summarization with Trimodal Hierarchical Multi-head Attention

Doosan Baek, Jiho Kim, Hongchul Lee

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    2 Citations (Scopus)

    Abstract

    Multimodal Abstractive Summarization is a challenging task that aims to generate concise and informative summaries from diverse modalities, such as video, audio, and text. In this study, we propose VATMAN, a novel approach for multimodal abstractive summarization. To effectively capture the hierarchical relationships and dependencies between modalities, we introduce Trimodal Hierarchical Multi-head Attention (THMA). THMA hierarchically attends to the video, audio, and textual representations, enabling the model to distill salient information and generate cohesive and coherent summaries. VATMAN leverages state-of-the-art generative pretrained language models (GPLMs), specifically Transformer-based models, and applies hierarchical attention at the modality level, which enhances the utilization of contextual information. The proposed VATMAN model on the How2 dataset demonstrates the ability to create more fluent summaries than those generated by human authors, showcasing its potential for utilization in various industrial environments.

    Original languageEnglish
    Title of host publicationICTC 2023 - 14th International Conference on Information and Communication Technology Convergence
    Subtitle of host publicationExploring the Frontiers of ICT Innovation
    PublisherIEEE Computer Society
    Pages1475-1478
    Number of pages4
    ISBN (Electronic)9798350313277
    DOIs
    Publication statusPublished - 2023
    Event14th International Conference on Information and Communication Technology Convergence, ICTC 2023 - Jeju Island, Korea, Republic of
    Duration: 2023 Oct 112023 Oct 13

    Publication series

    NameInternational Conference on ICT Convergence
    ISSN (Print)2162-1233
    ISSN (Electronic)2162-1241

    Conference

    Conference14th International Conference on Information and Communication Technology Convergence, ICTC 2023
    Country/TerritoryKorea, Republic of
    CityJeju Island
    Period23/10/1123/10/13

    Bibliographical note

    Publisher Copyright:
    © 2023 IEEE.

    Keywords

    • Abstractive Summarization
    • Generative Pretrained Language Model
    • Transformer
    • Trimodal

    ASJC Scopus subject areas

    • Information Systems
    • Computer Networks and Communications

    Fingerprint

    Dive into the research topics of 'VATMAN: Video-Audio-Text Multimodal Abstractive Summarization with Trimodal Hierarchical Multi-head Attention'. Together they form a unique fingerprint.

    Cite this