Abstract
Byte Pair Encoding is an effective approach in machine translation across several languages. However, our analysis indicates that BPE is prone to over-segmentation in the morphologically rich language, Korean, which can erode word semantics and lead to semantic confusion during training. This semantic confusion, stemming from over-segmentation, ultimately contributes to a degradation of overall translation quality. To address this issue, we introduce Length-aware Subword Vocabulary Construction (LeVoC), a novel approach strategically incorporating longer words into the vocabulary. By utilizing an external monolingual Korean corpus, LeVoC extracts and integrates long words, effectively preserving morphological information and reducing semantic confusion. Our experiments demonstrate that LeVoC not only significantly outperforms BPE, but also can be applied to and surpass current state-of-the-art morpheme-aware subword tokenization methods. We provide evidence that the difficulty in translating sentences with long words in Korean is associated with morphological compositionality, and LeVoC's ability to reduce semantic confusion during training leads to improved translation quality.
| Original language | English |
|---|---|
| Title of host publication | The 62nd Annual Meeting of the Association for Computational Linguistics |
| Subtitle of host publication | Findings of the Association for Computational Linguistics, ACL 2024 |
| Editors | Lun-Wei Ku, Andre Martins, Vivek Srikumar |
| Publisher | Association for Computational Linguistics (ACL) |
| Pages | 2287-2303 |
| Number of pages | 17 |
| ISBN (Electronic) | 9798891760998 |
| DOIs | |
| Publication status | Published - 2024 |
| Event | Findings of the 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024 - Hybrid, Bangkok, Thailand Duration: 2024 Aug 11 → 2024 Aug 16 |
Publication series
| Name | Proceedings of the Annual Meeting of the Association for Computational Linguistics |
|---|---|
| ISSN (Print) | 0736-587X |
Conference
| Conference | Findings of the 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024 |
|---|---|
| Country/Territory | Thailand |
| City | Hybrid, Bangkok |
| Period | 24/8/11 → 24/8/16 |
Bibliographical note
Publisher Copyright:© 2024 Association for Computational Linguistics.
ASJC Scopus subject areas
- Language and Linguistics
- Linguistics and Language
- Computer Science Applications
Fingerprint
Dive into the research topics of 'Length-aware Byte Pair Encoding for Mitigating Over-segmentation in Korean Machine Translation'. Together they form a unique fingerprint.Cite this
- APA
- Standard
- Harvard
- Vancouver
- Author
- BIBTEX
- RIS