Abstract
Understanding structured table data with language models is crucial for various downstream tasks in information retrieval. However, transformer-based table embedding models struggle to consistently represent headers across varying entity contexts. This inconsistency undermines the generalizability of embeddings across in-domain tables that share universal semantics. To address this gap, we propose a novel pretraining method for in-domain tables, HAETAE, that explicitly separates header embeddings from contextual entity embeddings. Our method introduces a dedicated header encoder and learnable alignment mechanisms, built upon header-aware serialization. Experimental results demonstrate that HAETAE enhances generalization and stability in predicting headers and values of in-domain tables, achieving higher accuracy than baselines while showing the notable potential of knowledge transfer in cross-domain tables. The source code of HAETAE is available at https://github.com/woojoonjung/HAETAE.
| Original language | English |
|---|---|
| Title of host publication | SIGIR 2025 - Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval |
| Publisher | Association for Computing Machinery, Inc |
| Pages | 3065-3069 |
| Number of pages | 5 |
| ISBN (Electronic) | 9798400715921 |
| DOIs | |
| Publication status | Published - 2025 Jul 13 |
| Event | 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2025 - Padua, Italy Duration: 2025 Jul 13 → 2025 Jul 18 |
Publication series
| Name | SIGIR 2025 - Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval |
|---|
Conference
| Conference | 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2025 |
|---|---|
| Country/Territory | Italy |
| City | Padua |
| Period | 25/7/13 → 25/7/18 |
Bibliographical note
Publisher Copyright:© 2025 Copyright held by the owner/author(s).
Keywords
- Header anchoring
- Table embedding
- Table pretraining
ASJC Scopus subject areas
- Information Systems
- Software
Fingerprint
Dive into the research topics of 'HAETAE: In-domain Table Pretraining with Header Anchoring'. Together they form a unique fingerprint.Cite this
- APA
- Standard
- Harvard
- Vancouver
- Author
- BIBTEX
- RIS