Abstract
Recent speaker verification studies have achieved notable success by leveraging layer-wise output from pre-trained Transformer models. However, few have explored the advancements in aggregating these multi-level features beyond the static weighted average. We present Layer Attentive Pooling (LAP), a novel strategy for aggregating inter-layer representations from pre-trained speech models for speaker verification. LAP assesses the significance of each layer from multiple perspectives time-dynamically, and employs max pooling instead of averaging. Additionally, we propose a lightweight backend speaker model comprising LAP and Attentive Statistical Temporal Pooling (ASTP) to extract speaker embeddings from pre-trained model output. Experiments on the VoxCeleb benchmark reveal that our compact architecture achieves state-of-the-art performance while greatly reducing the training time. We further analyzed LAP design and its dynamic weighting mechanism for capturing speaker characteristics.
| Original language | English |
|---|---|
| Pages (from-to) | 3713-3717 |
| Number of pages | 5 |
| Journal | Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH |
| DOIs | |
| Publication status | Published - 2025 |
| Event | 26th Interspeech Conference 2025 - Rotterdam, Netherlands Duration: 2025 Aug 17 → 2025 Aug 21 |
Bibliographical note
Publisher Copyright:© 2025 International Speech Communication Association. All rights reserved.
Keywords
- fine-tuning efficiency
- multi-level features
- speaker recognition
- speaker verification
- speech pre-trained model
ASJC Scopus subject areas
- Software
- Signal Processing
- Language and Linguistics
- Modelling and Simulation
- Human-Computer Interaction
Fingerprint
Dive into the research topics of 'Rethinking Leveraging Pre-Trained Multi-Layer Representations for Speaker Verification'. Together they form a unique fingerprint.Cite this
- APA
- Standard
- Harvard
- Vancouver
- Author
- BIBTEX
- RIS