An empirical study on web mining of parallel data

Gumwon Hong, Chi Ho Li, Ming Zhou, Hae Chang Rim

Research output: Contribution to conferencePaperpeer-review

15 Citations (Scopus)

Abstract

This paper1 presents an empirical approach to mining parallel corpora. Conventional approaches use a readily available collection of comparable, nonparallel corpora to extract parallel sentences. This paper attempts the much more challenging task of directly searching for high-quality sentence pairs from the Web. We tackle the problem by formulating good search query using Learning to Rank? and by filtering noisy document pairs using IBM Model 1 alignment. End-to-end evaluation shows that the proposed approach significantly improves the performance of statistical machine translation.

Original languageEnglish
Pages474-482
Number of pages9
Publication statusPublished - 2010
Event23rd International Conference on Computational Linguistics, Coling 2010 - Beijing, China
Duration: 2010 Aug 232010 Aug 27

Other

Other23rd International Conference on Computational Linguistics, Coling 2010
Country/TerritoryChina
CityBeijing
Period10/8/2310/8/27

ASJC Scopus subject areas

  • Language and Linguistics
  • Computational Theory and Mathematics
  • Linguistics and Language

Fingerprint

Dive into the research topics of 'An empirical study on web mining of parallel data'. Together they form a unique fingerprint.

Cite this