Break it Down into BTS: Basic, Tiniest Subword Units for Korean

Nayeon Kim, Jun Hyung Park, Joon Young Choi, Eojin Jeon, Youjin Kang, Sang Keun Lee

Research output: Contribution to conferencePaperpeer-review

Abstract

We introduce Basic, Tiniest Subword (BTS) units for the Korean language, which are inspired by the invention principle of Hangeul, the Korean writing system. Instead of relying on 51 Korean consonant and vowel letters, we form the letters from BTS units by adding strokes or combining them. To examine the impact of BTS units on Korean language processing, we develop a novel BTSbased word embedding framework that is readily applicable to various models. Our experiments reveal that BTS units significantly improve the performance of Korean word embedding on all intrinsic and extrinsic tasks in our evaluation. In particular, BTS-based word embedding outperforms the state-of-the-art Korean word embedding by 11.8% in word analogy. We further investigate the unique advantages provided by BTS units through in-depth analysis. Our code is available at https://github.com/irishev/BTS.

Original languageEnglish
Pages7007-7024
Number of pages18
Publication statusPublished - 2022
Event2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022 - Abu Dhabi, United Arab Emirates
Duration: 2022 Dec 72022 Dec 11

Conference

Conference2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022
Country/TerritoryUnited Arab Emirates
CityAbu Dhabi
Period22/12/722/12/11

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Computer Science Applications
  • Information Systems

Fingerprint

Dive into the research topics of 'Break it Down into BTS: Basic, Tiniest Subword Units for Korean'. Together they form a unique fingerprint.

Cite this