Length-aware Byte Pair Encoding for Mitigating Over-segmentation in Korean Machine Translation

  • Jungseob Lee
  • , Hyeonseok Moon
  • , Seungjun Lee
  • , Chanjun Park*
  • , Sugyeong Eo
  • , Hyunwoong Ko
  • , Jaehyung Seo
  • , Seungyoon Lee
  • , Heuiseok Lim*
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Byte Pair Encoding is an effective approach in machine translation across several languages. However, our analysis indicates that BPE is prone to over-segmentation in the morphologically rich language, Korean, which can erode word semantics and lead to semantic confusion during training. This semantic confusion, stemming from over-segmentation, ultimately contributes to a degradation of overall translation quality. To address this issue, we introduce Length-aware Subword Vocabulary Construction (LeVoC), a novel approach strategically incorporating longer words into the vocabulary. By utilizing an external monolingual Korean corpus, LeVoC extracts and integrates long words, effectively preserving morphological information and reducing semantic confusion. Our experiments demonstrate that LeVoC not only significantly outperforms BPE, but also can be applied to and surpass current state-of-the-art morpheme-aware subword tokenization methods. We provide evidence that the difficulty in translating sentences with long words in Korean is associated with morphological compositionality, and LeVoC's ability to reduce semantic confusion during training leads to improved translation quality.

Original languageEnglish
Title of host publicationThe 62nd Annual Meeting of the Association for Computational Linguistics
Subtitle of host publicationFindings of the Association for Computational Linguistics, ACL 2024
EditorsLun-Wei Ku, Andre Martins, Vivek Srikumar
PublisherAssociation for Computational Linguistics (ACL)
Pages2287-2303
Number of pages17
ISBN (Electronic)9798891760998
DOIs
Publication statusPublished - 2024
EventFindings of the 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024 - Hybrid, Bangkok, Thailand
Duration: 2024 Aug 112024 Aug 16

Publication series

NameProceedings of the Annual Meeting of the Association for Computational Linguistics
ISSN (Print)0736-587X

Conference

ConferenceFindings of the 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024
Country/TerritoryThailand
CityHybrid, Bangkok
Period24/8/1124/8/16

Bibliographical note

Publisher Copyright:
© 2024 Association for Computational Linguistics.

ASJC Scopus subject areas

  • Language and Linguistics
  • Linguistics and Language
  • Computer Science Applications

Fingerprint

Dive into the research topics of 'Length-aware Byte Pair Encoding for Mitigating Over-segmentation in Korean Machine Translation'. Together they form a unique fingerprint.

Cite this