Doubts on the reliability of parallel corpus filtering

Hyeonseok Moon, Chanjun Park, Seonmin Koo, Jungseob Lee, Seungjun Lee, Jaehyung Seo, Sugyeong Eo, Yoonna Jang, Hyunjoong Kim, Hyoung gyu Lee, Heuiseok Lim

Research output: Contribution to journalArticlepeer-review


Parallel corpus filtering (PCF) aims to filter out low-quality data residing in parallel corpora. Recently, deep learning-based methods have been employed to assess the quality of sentence pairs in a parallel corpus, along with rule-based filtering that filters out noisy data depending on the pre-defined error types. Despite their utilization, to the best of our knowledge, a comprehensive investigation into the practical applicability and interpretability of PCF techniques remains unexplored. In this study, we raise two doubts on deep learning-based PCF: (i) Can deep learning-based PCF extract high-quality data? and (ii) Are scoring functions of PCF reliable? To answer these questions, we conduct comparative experiments on various PCF techniques with four datasets on two language pairs, English–Korean, and English–Japanese. Through the experiments, we demonstrate that the performance of the deep learning-based PCF highly depends on the targeting parallel corpus, and shows fluctuating adaptability depending on their characteristics. In particular, we figure out that high-scored sentences derived by the PCF technique do not necessarily guarantee high-quality results, rather it shows unintended preference.

Original languageEnglish
Article number120962
JournalExpert Systems With Applications
Publication statusPublished - 2023 Dec 15

Bibliographical note

Funding Information:
This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2020-0-00368 , A Neural-Symbolic Model for Knowledge Acquisition and Inference Techniques).

Publisher Copyright:
© 2023


  • Deep learning
  • Machine Translation
  • Natural Language Processing
  • Parallel corpus filtering

ASJC Scopus subject areas

  • Engineering(all)
  • Computer Science Applications
  • Artificial Intelligence


Dive into the research topics of 'Doubts on the reliability of parallel corpus filtering'. Together they form a unique fingerprint.

Cite this