An empirical study on web mining of parallel data

Gumwon Hong, Chi Ho Li, Ming Zhou, Hae Chang Rim

    Research output: Contribution to conferencePaperpeer-review

    15 Citations (Scopus)

    Abstract

    This paper1 presents an empirical approach to mining parallel corpora. Conventional approaches use a readily available collection of comparable, nonparallel corpora to extract parallel sentences. This paper attempts the much more challenging task of directly searching for high-quality sentence pairs from the Web. We tackle the problem by formulating good search query using Learning to Rank? and by filtering noisy document pairs using IBM Model 1 alignment. End-to-end evaluation shows that the proposed approach significantly improves the performance of statistical machine translation.

    Original languageEnglish
    Pages474-482
    Number of pages9
    Publication statusPublished - 2010
    Event23rd International Conference on Computational Linguistics, Coling 2010 - Beijing, China
    Duration: 2010 Aug 232010 Aug 27

    Other

    Other23rd International Conference on Computational Linguistics, Coling 2010
    Country/TerritoryChina
    CityBeijing
    Period10/8/2310/8/27

    ASJC Scopus subject areas

    • Language and Linguistics
    • Computational Theory and Mathematics
    • Linguistics and Language

    Fingerprint

    Dive into the research topics of 'An empirical study on web mining of parallel data'. Together they form a unique fingerprint.

    Cite this