Abstract
Most weakly supervised named entity recognition (NER) models rely on domain-specific dictionaries provided by experts. This approach is infeasible in many domains where dictionaries do not exist. While a phrase retrieval model was used to construct pseudo-dictionaries with entities retrieved from Wikipedia automatically in a recent study, these dictionaries often have limited coverage because the retriever is likely to retrieve popular entities rather than rare ones. In this study, we present a novel framework, HighGEN, that generates NER datasets with high-coverage pseudo-dictionaries. Specifically, we create entity-rich dictionaries with a novel search method, called phrase embedding search, which encourages the retriever to search a space densely populated with various entities. In addition, we use a new verification process based on the embedding distance between candidate entity mentions and entity types to reduce the false-positive noise in weak labels generated by high-coverage dictionaries. We demonstrate that HighGEN outperforms the previous best model by an average F1 score of 4.7 across five NER benchmark datasets.
| Original language | English |
|---|---|
| Title of host publication | Long Papers |
| Publisher | Association for Computational Linguistics (ACL) |
| Pages | 7148-7163 |
| Number of pages | 16 |
| ISBN (Electronic) | 9781959429722 |
| DOIs | |
| Publication status | Published - 2023 |
| Event | 61st Annual Meeting of the Association for Computational Linguistics, ACL 2023 - Toronto, Canada Duration: 2023 Jul 9 → 2023 Jul 14 |
Publication series
| Name | Proceedings of the Annual Meeting of the Association for Computational Linguistics |
|---|---|
| Volume | 1 |
| ISSN (Print) | 0736-587X |
Conference
| Conference | 61st Annual Meeting of the Association for Computational Linguistics, ACL 2023 |
|---|---|
| Country/Territory | Canada |
| City | Toronto |
| Period | 23/7/9 → 23/7/14 |
Bibliographical note
Publisher Copyright:© 2023 Association for Computational Linguistics.
ASJC Scopus subject areas
- Language and Linguistics
- Linguistics and Language
- Computer Science Applications
Fingerprint
Dive into the research topics of 'Automatic Creation of Named Entity Recognition Datasets by Querying Phrase Representations'. Together they form a unique fingerprint.Cite this
- APA
- Standard
- Harvard
- Vancouver
- Author
- BIBTEX
- RIS