Abstract
Most of the recent natural language processing (NLP) studies are based on the pretrain-finetuning approach (PFA). However for small and medium-sized industries with insufficient hardware, there are many limitations in servicing latest PFA based NLP application software, due to slow speed and insufficient memory. Since these approaches generally require large amounts of data, it is much more difficult to service with PFA especially for low-resource languages. We propose a new tokenization method, ONE-Piece, to address this limitation. ONE-Piece combines morphologically-aware subword tokenization and vocabulary communicating method, which has not been carefully considered before. Our proposed method can also be utilized without modifying the model structure. We experiment by applying ONE-Piece to Korean, a morphologically-rich and low-resource language. We revealed that ONE-Piece with vanilla transformer model can achieve comparable performance to the current Korean-English machine translation state-of-the-art model.
Original language | English |
---|---|
Title of host publication | NAACL-HLT 2021 - 2021 Conference of the North American Chapter of the Association for Computational Linguistics |
Subtitle of host publication | Human Language Technologies, Industry Papers |
Publisher | Association for Computational Linguistics (ACL) |
Pages | 97-104 |
Number of pages | 8 |
ISBN (Electronic) | 9781954085473 |
DOIs | |
Publication status | Published - 2021 |
Event | 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021 - Virtual, Online Duration: 2021 Jun 6 → 2021 Jun 11 |
Publication series
Name | NAACL-HLT 2021 - 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Industry Papers |
---|
Conference
Conference | 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021 |
---|---|
City | Virtual, Online |
Period | 21/6/6 → 21/6/11 |
Bibliographical note
Publisher Copyright:© 2021 Association for Computational Linguistics.
ASJC Scopus subject areas
- Computer Networks and Communications
- Hardware and Architecture
- Information Systems
- Software