Rethinking Leveraging Pre-Trained Multi-Layer Representations for Speaker Verification

  • Jin Sob Kim
  • , Hyun Joon Park
  • , Wooseok Shin
  • , Sung Won Han*
  • *Corresponding author for this work

Research output: Contribution to journalConference articlepeer-review

Abstract

Recent speaker verification studies have achieved notable success by leveraging layer-wise output from pre-trained Transformer models. However, few have explored the advancements in aggregating these multi-level features beyond the static weighted average. We present Layer Attentive Pooling (LAP), a novel strategy for aggregating inter-layer representations from pre-trained speech models for speaker verification. LAP assesses the significance of each layer from multiple perspectives time-dynamically, and employs max pooling instead of averaging. Additionally, we propose a lightweight backend speaker model comprising LAP and Attentive Statistical Temporal Pooling (ASTP) to extract speaker embeddings from pre-trained model output. Experiments on the VoxCeleb benchmark reveal that our compact architecture achieves state-of-the-art performance while greatly reducing the training time. We further analyzed LAP design and its dynamic weighting mechanism for capturing speaker characteristics.

Original languageEnglish
Pages (from-to)3713-3717
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
DOIs
Publication statusPublished - 2025
Event26th Interspeech Conference 2025 - Rotterdam, Netherlands
Duration: 2025 Aug 172025 Aug 21

Bibliographical note

Publisher Copyright:
© 2025 International Speech Communication Association. All rights reserved.

Keywords

  • fine-tuning efficiency
  • multi-level features
  • speaker recognition
  • speaker verification
  • speech pre-trained model

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Language and Linguistics
  • Modelling and Simulation
  • Human-Computer Interaction

Fingerprint

Dive into the research topics of 'Rethinking Leveraging Pre-Trained Multi-Layer Representations for Speaker Verification'. Together they form a unique fingerprint.

Cite this