Acquiring Korean lexical entry from a raw corpus

Wonhee Yu, Kinam Park, Soonyoung Jung, Heuiseok Lim

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

This paper proposes a computational lexical entry acquisition model based on a representation model of the mental lexicon. The proposed model acquires lexical entries from a raw corpus by unsupervised learning like human. The model is composed of full-form and morpheme acquisition modules. In the full-from acquisition module, core full-forms are automatically acquired according to the frequency and recency thresholds. In the morpheme acquisition module, a repeatedly occurring substring in different full-forms is chosen as a candidate morpheme. Then, the candidate is corroborated as a morpheme by using the entropy measure of syllables in the string. The experimental results with a Korean corpus of which size is about 16 million full-forms show that the model successively acquires major full-forms and morphemes with the precision of 100% and 99.04%, respectively.

Original languageEnglish
Title of host publication2010 2nd International Conference on Information Technology Convergence and Services, ITCS 2010
DOIs
Publication statusPublished - 2010
Event2010 2nd International Conference on Information Technology Convergence and Services, ITCS 2010 - Cebu, Philippines
Duration: 2010 Aug 112010 Aug 13

Publication series

Name2010 2nd International Conference on Information Technology Convergence and Services, ITCS 2010

Other

Other2010 2nd International Conference on Information Technology Convergence and Services, ITCS 2010
Country/TerritoryPhilippines
CityCebu
Period10/8/1110/8/13

Keywords

  • Language learning
  • Lexical acquisition
  • Machine readable dictionary
  • Mental lexicon

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Information Systems

Fingerprint

Dive into the research topics of 'Acquiring Korean lexical entry from a raw corpus'. Together they form a unique fingerprint.

Cite this