Empirical Analysis of Noising Scheme based Synthetic Data Generation for Automatic Post-editing

Hyeonseok Moon, Chanjun Park, Seolhwa Lee, Jaehyung Seo, Jungseob Lee, Sugyeong Eo, Heuiseok Lim

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Automatic post-editing (APE) refers to a research field that aims to automatically correct errors included in the translation sentences derived by the machine translation system. This study has several limitations, considering the data acquisition, because there is no official dataset for most language pairs. Moreover, the amount of data is restricted even for language pairs in which official data has been released, such as WMT. To solve this problem and promote universal APE research regardless of APE data existence, this study proposes a method for automatically generating APE data based on a noising scheme from a parallel corpus. Particularly, we propose a human mimicking errors-based noising scheme that considers a practical correction process at the human level. We propose a precise inspection to attain high performance, and we derived the optimal noising schemes that show substantial effectiveness. Through these, we also demonstrate that depending on the type of noise, the noising scheme-based APE data generation may lead to inferior performance. In addition, we propose a dynamic noise injection strategy that enables the acquisition of a robust error correction capability and demonstrated its effectiveness by comparative analysis. This study enables obtaining a high performance APE model without human-generated data and can promote universal APE research for all language pairs targeting English.

Original languageEnglish
Title of host publication2022 Language Resources and Evaluation Conference, LREC 2022
EditorsNicoletta Calzolari, Frederic Bechet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Helene Mazo, Jan Odijk, Stelios Piperidis
PublisherEuropean Language Resources Association (ELRA)
Pages883-891
Number of pages9
ISBN (Electronic)9791095546726
Publication statusPublished - 2022
Event13th International Conference on Language Resources and Evaluation Conference, LREC 2022 - Marseille, France
Duration: 2022 Jun 202022 Jun 25

Publication series

Name2022 Language Resources and Evaluation Conference, LREC 2022

Conference

Conference13th International Conference on Language Resources and Evaluation Conference, LREC 2022
Country/TerritoryFrance
CityMarseille
Period22/6/2022/6/25

Bibliographical note

Funding Information:
Technology Research Center) support program(IITP-2018-0-01405) supervised by the IITP(Institute for Information Communications Technology Planning Evaluation), and the Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education(NRF-2021R1A6A1A03045425) Chatterjee, R., Negri, M., Raphael, R., and Turchi, M. (2018). Findings of the wmt 2018 shared task on automatic post-editing. In Third Conference on Ma-chine Translation (WMT), pages 723–738. Associa-tion for Computational Linguistics (ACL).

Funding Information:
This research was supported by the MSIT(Ministry of Science and ICT), Korea, under the ITRC(Information

Publisher Copyright:
© European Language Resources Association (ELRA), licensed under CC-BY-NC-4.0.

Keywords

  • Automatic Post-Editing
  • Data Generation
  • Machine Translation
  • Noise Injection

ASJC Scopus subject areas

  • Language and Linguistics
  • Library and Information Sciences
  • Linguistics and Language
  • Education

Fingerprint

Dive into the research topics of 'Empirical Analysis of Noising Scheme based Synthetic Data Generation for Automatic Post-editing'. Together they form a unique fingerprint.

Cite this