Representation learning for unseen words by bridging subwords to semantic networks

Yeachan Kim, Kang Min Kim, Sang Keun Lee

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    Pre-trained word embeddings are widely used in various fields. However, the coverage of pre-trained word embeddings only includes words that appeared in corpora where pre-trained embeddings are learned. It means that the words which do not appear in training corpus are ignored in tasks, and it could lead to the limited performance of neural models. In this paper, we propose a simple yet effective method to represent out-of-vocabulary (OOV) words. Unlike prior works that solely utilize subword information or knowledge, our method makes use of both information to represent OOV words. To this end, we propose two stages of representation learning. In the first stage, we learn subword embeddings from the pre-trained word embeddings by using an additive composition function of subwords. In the second stage, we map the learned subwords into semantic networks (e.g., WordNet). We then re-train the subword embeddings by using lexical entries on semantic lexicons that could include newly observed subwords. This two-stage learning makes the coverage of words broaden to a great extent. The experimental results clearly show that our method provides consistent performance improvements over strong baselines that use subwords or lexical resources separately.

    Original languageEnglish
    Title of host publicationLREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings
    EditorsNicoletta Calzolari, Frederic Bechet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
    PublisherEuropean Language Resources Association (ELRA)
    Pages4774-4780
    Number of pages7
    ISBN (Electronic)9791095546344
    Publication statusPublished - 2020
    Event12th International Conference on Language Resources and Evaluation, LREC 2020 - Marseille, France
    Duration: 2020 May 112020 May 16

    Publication series

    NameLREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings

    Conference

    Conference12th International Conference on Language Resources and Evaluation, LREC 2020
    Country/TerritoryFrance
    CityMarseille
    Period20/5/1120/5/16

    Bibliographical note

    Publisher Copyright:
    © European Language Resources Association (ELRA), licensed under CC-BY-NC

    Keywords

    • Knowledge Representation
    • Lexicon
    • Semantics

    ASJC Scopus subject areas

    • Language and Linguistics
    • Education
    • Library and Information Sciences
    • Linguistics and Language

    Fingerprint

    Dive into the research topics of 'Representation learning for unseen words by bridging subwords to semantic networks'. Together they form a unique fingerprint.

    Cite this