HAETAE: In-domain Table Pretraining with Header Anchoring

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Understanding structured table data with language models is crucial for various downstream tasks in information retrieval. However, transformer-based table embedding models struggle to consistently represent headers across varying entity contexts. This inconsistency undermines the generalizability of embeddings across in-domain tables that share universal semantics. To address this gap, we propose a novel pretraining method for in-domain tables, HAETAE, that explicitly separates header embeddings from contextual entity embeddings. Our method introduces a dedicated header encoder and learnable alignment mechanisms, built upon header-aware serialization. Experimental results demonstrate that HAETAE enhances generalization and stability in predicting headers and values of in-domain tables, achieving higher accuracy than baselines while showing the notable potential of knowledge transfer in cross-domain tables. The source code of HAETAE is available at https://github.com/woojoonjung/HAETAE.

Original languageEnglish
Title of host publicationSIGIR 2025 - Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval
PublisherAssociation for Computing Machinery, Inc
Pages3065-3069
Number of pages5
ISBN (Electronic)9798400715921
DOIs
Publication statusPublished - 2025 Jul 13
Event48th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2025 - Padua, Italy
Duration: 2025 Jul 132025 Jul 18

Publication series

NameSIGIR 2025 - Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval

Conference

Conference48th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2025
Country/TerritoryItaly
CityPadua
Period25/7/1325/7/18

Bibliographical note

Publisher Copyright:
© 2025 Copyright held by the owner/author(s).

Keywords

  • Header anchoring
  • Table embedding
  • Table pretraining

ASJC Scopus subject areas

  • Information Systems
  • Software

Fingerprint

Dive into the research topics of 'HAETAE: In-domain Table Pretraining with Header Anchoring'. Together they form a unique fingerprint.

Cite this