Abstract
This paper1 presents an empirical approach to mining parallel corpora. Conventional approaches use a readily available collection of comparable, nonparallel corpora to extract parallel sentences. This paper attempts the much more challenging task of directly searching for high-quality sentence pairs from the Web. We tackle the problem by formulating good search query using Learning to Rank? and by filtering noisy document pairs using IBM Model 1 alignment. End-to-end evaluation shows that the proposed approach significantly improves the performance of statistical machine translation.
Original language | English |
---|---|
Pages | 474-482 |
Number of pages | 9 |
Publication status | Published - 2010 |
Event | 23rd International Conference on Computational Linguistics, Coling 2010 - Beijing, China Duration: 2010 Aug 23 → 2010 Aug 27 |
Other
Other | 23rd International Conference on Computational Linguistics, Coling 2010 |
---|---|
Country/Territory | China |
City | Beijing |
Period | 10/8/23 → 10/8/27 |
ASJC Scopus subject areas
- Language and Linguistics
- Computational Theory and Mathematics
- Linguistics and Language