Automatic post-editing (APE) refers to a research field that aims to automatically correct errors included in the translation sentences derived by the machine translation system. This study has several limitations, considering the data acquisition, because there is no official dataset for most language pairs. Moreover, the amount of data is restricted even for language pairs in which official data has been released, such as WMT. To solve this problem and promote universal APE research regardless of APE data existence, this study proposes a method for automatically generating APE data based on a noising scheme from a parallel corpus. Particularly, we propose a human mimicking errors-based noising scheme that considers a practical correction process at the human level. We propose a precise inspection to attain high performance, and we derived the optimal noising schemes that show substantial effectiveness. Through these, we also demonstrate that depending on the type of noise, the noising scheme-based APE data generation may lead to inferior performance. In addition, we propose a dynamic noise injection strategy that enables the acquisition of a robust error correction capability and demonstrated its effectiveness by comparative analysis. This study enables obtaining a high performance APE model without human-generated data and can promote universal APE research for all language pairs targeting English.
|Title of host publication||2022 Language Resources and Evaluation Conference, LREC 2022|
|Editors||Nicoletta Calzolari, Frederic Bechet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Helene Mazo, Jan Odijk, Stelios Piperidis|
|Publisher||European Language Resources Association (ELRA)|
|Number of pages||9|
|Publication status||Published - 2022|
|Event||13th International Conference on Language Resources and Evaluation Conference, LREC 2022 - Marseille, France|
Duration: 2022 Jun 20 → 2022 Jun 25
|Name||2022 Language Resources and Evaluation Conference, LREC 2022|
|Conference||13th International Conference on Language Resources and Evaluation Conference, LREC 2022|
|Period||22/6/20 → 22/6/25|
Bibliographical noteFunding Information:
Technology Research Center) support program(IITP-2018-0-01405) supervised by the IITP(Institute for Information Communications Technology Planning Evaluation), and the Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education(NRF-2021R1A6A1A03045425) Chatterjee, R., Negri, M., Raphael, R., and Turchi, M. (2018). Findings of the wmt 2018 shared task on automatic post-editing. In Third Conference on Ma-chine Translation (WMT), pages 723–738. Associa-tion for Computational Linguistics (ACL).
This research was supported by the MSIT(Ministry of Science and ICT), Korea, under the ITRC(Information
© European Language Resources Association (ELRA), licensed under CC-BY-NC-4.0.
- Automatic Post-Editing
- Data Generation
- Machine Translation
- Noise Injection
ASJC Scopus subject areas
- Language and Linguistics
- Library and Information Sciences
- Linguistics and Language