Abstract
Parallel corpus filtering (PCF) aims to filter out low-quality data residing in parallel corpora. Recently, deep learning-based methods have been employed to assess the quality of sentence pairs in a parallel corpus, along with rule-based filtering that filters out noisy data depending on the pre-defined error types. Despite their utilization, to the best of our knowledge, a comprehensive investigation into the practical applicability and interpretability of PCF techniques remains unexplored. In this study, we raise two doubts on deep learning-based PCF: (i) Can deep learning-based PCF extract high-quality data? and (ii) Are scoring functions of PCF reliable? To answer these questions, we conduct comparative experiments on various PCF techniques with four datasets on two language pairs, English–Korean, and English–Japanese. Through the experiments, we demonstrate that the performance of the deep learning-based PCF highly depends on the targeting parallel corpus, and shows fluctuating adaptability depending on their characteristics. In particular, we figure out that high-scored sentences derived by the PCF technique do not necessarily guarantee high-quality results, rather it shows unintended preference.
Original language | English |
---|---|
Article number | 120962 |
Journal | Expert Systems With Applications |
Volume | 233 |
DOIs | |
Publication status | Published - 2023 Dec 15 |
Bibliographical note
Funding Information:This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2020-0-00368 , A Neural-Symbolic Model for Knowledge Acquisition and Inference Techniques).
Publisher Copyright:
© 2023
Keywords
- Deep learning
- Machine Translation
- Natural Language Processing
- Parallel corpus filtering
ASJC Scopus subject areas
- General Engineering
- Computer Science Applications
- Artificial Intelligence