Automatic Creation of Named Entity Recognition Datasets by Querying Phrase Representations

Hyunjae Kim, Jaehyo Yoo, Seunghyun Yoon, Jaewoo Kang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Most weakly supervised named entity recognition (NER) models rely on domain-specific dictionaries provided by experts. This approach is infeasible in many domains where dictionaries do not exist. While a phrase retrieval model was used to construct pseudo-dictionaries with entities retrieved from Wikipedia automatically in a recent study, these dictionaries often have limited coverage because the retriever is likely to retrieve popular entities rather than rare ones. In this study, we present a novel framework, HighGEN, that generates NER datasets with high-coverage pseudo-dictionaries. Specifically, we create entity-rich dictionaries with a novel search method, called phrase embedding search, which encourages the retriever to search a space densely populated with various entities. In addition, we use a new verification process based on the embedding distance between candidate entity mentions and entity types to reduce the false-positive noise in weak labels generated by high-coverage dictionaries. We demonstrate that HighGEN outperforms the previous best model by an average F1 score of 4.7 across five NER benchmark datasets.

Original languageEnglish
Title of host publicationLong Papers
PublisherAssociation for Computational Linguistics (ACL)
Pages7148-7163
Number of pages16
ISBN (Electronic)9781959429722
Publication statusPublished - 2023
Event61st Annual Meeting of the Association for Computational Linguistics, ACL 2023 - Toronto, Canada
Duration: 2023 Jul 92023 Jul 14

Publication series

NameProceedings of the Annual Meeting of the Association for Computational Linguistics
Volume1
ISSN (Print)0736-587X

Conference

Conference61st Annual Meeting of the Association for Computational Linguistics, ACL 2023
Country/TerritoryCanada
CityToronto
Period23/7/923/7/14

Bibliographical note

Publisher Copyright:
© 2023 Association for Computational Linguistics.

ASJC Scopus subject areas

  • Computer Science Applications
  • Linguistics and Language
  • Language and Linguistics

Fingerprint

Dive into the research topics of 'Automatic Creation of Named Entity Recognition Datasets by Querying Phrase Representations'. Together they form a unique fingerprint.

Cite this