Doubts on the reliability of parallel corpus filtering

Hyeonseok Moon, Chanjun Park, Seonmin Koo, Jungseob Lee, Seungjun Lee, Jaehyung Seo, Sugyeong Eo, Yoonna Jang, Hyunjoong Kim, Hyoung gyu Lee, Heuiseok Lim

    Research output: Contribution to journalArticlepeer-review

    Abstract

    Parallel corpus filtering (PCF) aims to filter out low-quality data residing in parallel corpora. Recently, deep learning-based methods have been employed to assess the quality of sentence pairs in a parallel corpus, along with rule-based filtering that filters out noisy data depending on the pre-defined error types. Despite their utilization, to the best of our knowledge, a comprehensive investigation into the practical applicability and interpretability of PCF techniques remains unexplored. In this study, we raise two doubts on deep learning-based PCF: (i) Can deep learning-based PCF extract high-quality data? and (ii) Are scoring functions of PCF reliable? To answer these questions, we conduct comparative experiments on various PCF techniques with four datasets on two language pairs, English–Korean, and English–Japanese. Through the experiments, we demonstrate that the performance of the deep learning-based PCF highly depends on the targeting parallel corpus, and shows fluctuating adaptability depending on their characteristics. In particular, we figure out that high-scored sentences derived by the PCF technique do not necessarily guarantee high-quality results, rather it shows unintended preference.

    Original languageEnglish
    Article number120962
    JournalExpert Systems With Applications
    Volume233
    DOIs
    Publication statusPublished - 2023 Dec 15

    Bibliographical note

    Funding Information:
    This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2020-0-00368 , A Neural-Symbolic Model for Knowledge Acquisition and Inference Techniques).

    Publisher Copyright:
    © 2023

    Keywords

    • Deep learning
    • Machine Translation
    • Natural Language Processing
    • Parallel corpus filtering

    ASJC Scopus subject areas

    • General Engineering
    • Computer Science Applications
    • Artificial Intelligence

    Fingerprint

    Dive into the research topics of 'Doubts on the reliability of parallel corpus filtering'. Together they form a unique fingerprint.

    Cite this