REVISE: A Framework for Revising OCRed text in Practical Information Systems with Data Contamination Strategy

  • Gyuho Shim
  • , Seongtae Hong
  • , Heuiseok Lim*
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Recent advances in Large Language Models (LLMs) have significantly improved the field of Document AI, demonstrating remarkable performance on document understanding tasks such as question answering. However, existing approaches primarily focus on solving specific tasks, lacking the capability to structurally organize and manage document information. To address this limitation, we propose REVISE, a framework that systematically corrects errors introduced by OCR at the character, word, and structural levels. Specifically, REVISE employs a comprehensive hierarchical taxonomy of common OCR errors and a synthetic data generation strategy that realistically simulates such errors to train an effective correction model. Experimental results demonstrate that REVISE effectively corrects OCR outputs, enabling more structured representation and systematic management of document contents. Consequently, our method significantly enhances downstream performance in document retrieval and question answering tasks, highlighting the potential to overcome the structural management limitations of existing Document AI frameworks.

Original languageEnglish
Title of host publicationIndustry Track
EditorsGeorg Rehm, Yunyao Li
PublisherAssociation for Computational Linguistics (ACL)
Pages1423-1434
Number of pages12
ISBN (Electronic)9798891762886
DOIs
Publication statusPublished - 2025
Event63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025 - Vienna, Austria
Duration: 2025 Jul 272025 Aug 1

Publication series

NameProceedings of the Annual Meeting of the Association for Computational Linguistics
Volume6
ISSN (Print)0736-587X

Conference

Conference63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025
Country/TerritoryAustria
CityVienna
Period25/7/2725/8/1

Bibliographical note

Publisher Copyright:
©2025 Association for Computational Linguistics.

ASJC Scopus subject areas

  • Language and Linguistics
  • Linguistics and Language
  • Computer Science Applications

Fingerprint

Dive into the research topics of 'REVISE: A Framework for Revising OCRed text in Practical Information Systems with Data Contamination Strategy'. Together they form a unique fingerprint.

Cite this