Uncovering the Risks and Drawbacks Associated with the Use of Synthetic Data for Grammatical Error Correction

Seonmin Koo, Chanjun Park, Seolhwa Lee, Jaehyung Seo, Sugyeong Eo, Hyeonseok Moon, Heuiseok Lim

Research output: Contribution to journalArticlepeer-review

Abstract

In a Data-Centric AI paradigm, the model performance is enhanced without altering the model architecture, as evidenced by real-world and benchmark dataset demonstrations. With the advancements of large language models (LLM), it has become increasingly feasible to generate high-quality synthetic data, while considering the need to construct fully synthetic datasets for real-world data containing numerous personal information. However, in-depth validation of the solely synthetic data setting has yet to be conducted, despite the increased possibility of models trained exclusively on fully synthetic data emerging in the future. Therefore, we examined the question, 'Do data quality control techniques (known to positively impact data-centric AI) consistently aid models trained exclusively on synthetic datasets?'. To explore this query, we performed detailed analyses using synthetic datasets generated for speech recognition postprocessing using the BackTranScription (BTS) approach. Our study primarily addressed the potential adverse effects of data quality control measures (e.g., noise injection and balanced data) and training strategies in the context of synthetic-only experiments. As a result of the experiment, we observed the negative effect that the data-centric methodology drops by a maximum of 44.03 points in the fully synthetic data setting.

Original languageEnglish
Pages (from-to)95747-95756
Number of pages10
JournalIEEE Access
Volume11
DOIs
Publication statusPublished - 2023

Bibliographical note

Publisher Copyright:
© 2013 IEEE.

Keywords

  • Korean grammatical error correction
  • balanced data
  • noise injection
  • synthetic data

ASJC Scopus subject areas

  • General Computer Science
  • General Materials Science
  • General Engineering

Fingerprint

Dive into the research topics of 'Uncovering the Risks and Drawbacks Associated with the Use of Synthetic Data for Grammatical Error Correction'. Together they form a unique fingerprint.

Cite this