Abstract
Most weakly supervised named entity recognition (NER) models rely on domain-specific dictionaries provided by experts. This approach is infeasible in many domains where dictionaries do not exist. While a phrase retrieval model was used to construct pseudo-dictionaries with entities retrieved from Wikipedia automatically in a recent study, these dictionaries often have limited coverage because the retriever is likely to retrieve popular entities rather than rare ones. In this study, we present a novel framework, HighGEN, that generates NER datasets with high-coverage pseudo-dictionaries. Specifically, we create entity-rich dictionaries with a novel search method, called phrase embedding search, which encourages the retriever to search a space densely populated with various entities. In addition, we use a new verification process based on the embedding distance between candidate entity mentions and entity types to reduce the false-positive noise in weak labels generated by high-coverage dictionaries. We demonstrate that HighGEN outperforms the previous best model by an average F1 score of 4.7 across five NER benchmark datasets.
Original language | English |
---|---|
Title of host publication | Long Papers |
Publisher | Association for Computational Linguistics (ACL) |
Pages | 7148-7163 |
Number of pages | 16 |
ISBN (Electronic) | 9781959429722 |
Publication status | Published - 2023 |
Event | 61st Annual Meeting of the Association for Computational Linguistics, ACL 2023 - Toronto, Canada Duration: 2023 Jul 9 → 2023 Jul 14 |
Publication series
Name | Proceedings of the Annual Meeting of the Association for Computational Linguistics |
---|---|
Volume | 1 |
ISSN (Print) | 0736-587X |
Conference
Conference | 61st Annual Meeting of the Association for Computational Linguistics, ACL 2023 |
---|---|
Country/Territory | Canada |
City | Toronto |
Period | 23/7/9 → 23/7/14 |
Bibliographical note
Publisher Copyright:© 2023 Association for Computational Linguistics.
ASJC Scopus subject areas
- Computer Science Applications
- Linguistics and Language
- Language and Linguistics