Abstract
In a Data-Centric AI paradigm, the model performance is enhanced without altering the model architecture, as evidenced by real-world and benchmark dataset demonstrations. With the advancements of large language models (LLM), it has become increasingly feasible to generate high-quality synthetic data, while considering the need to construct fully synthetic datasets for real-world data containing numerous personal information. However, in-depth validation of the solely synthetic data setting has yet to be conducted, despite the increased possibility of models trained exclusively on fully synthetic data emerging in the future. Therefore, we examined the question, 'Do data quality control techniques (known to positively impact data-centric AI) consistently aid models trained exclusively on synthetic datasets?'. To explore this query, we performed detailed analyses using synthetic datasets generated for speech recognition postprocessing using the BackTranScription (BTS) approach. Our study primarily addressed the potential adverse effects of data quality control measures (e.g., noise injection and balanced data) and training strategies in the context of synthetic-only experiments. As a result of the experiment, we observed the negative effect that the data-centric methodology drops by a maximum of 44.03 points in the fully synthetic data setting.
Original language | English |
---|---|
Pages (from-to) | 95747-95756 |
Number of pages | 10 |
Journal | IEEE Access |
Volume | 11 |
DOIs | |
Publication status | Published - 2023 |
Bibliographical note
Publisher Copyright:© 2013 IEEE.
Keywords
- Korean grammatical error correction
- balanced data
- noise injection
- synthetic data
ASJC Scopus subject areas
- General Computer Science
- General Materials Science
- General Engineering