Simple Questions Generate Named Entity Recognition Datasets

Hyunjae Kim, Jaehyo Yoo, Seunghyun Yoon, Jinhyuk Lee, Jaewoo Kang

Research output: Contribution to conferencePaperpeer-review

3 Citations (Scopus)

Abstract

Recent named entity recognition (NER) models often rely on human-annotated datasets, requiring the significant engagement of professional knowledge on the target domain and entities. This research introduces an ask-to-generate approach that automatically generates NER datasets by asking questions in simple natural language to an open-domain question answering system (e.g., “Which disease?”). Despite using fewer in-domain resources, our models, solely trained on the generated datasets, largely outperform strong low-resource models by an average F1 score of 19.4 for six popular NER benchmarks. Furthermore, our models provide competitive performance with rich-resource models that additionally leverage in-domain dictionaries provided by domain experts. In few-shot NER, we outperform the previous best model by an F1 score of 5.2 on three benchmarks and achieve new state-of-the-art performance. The code and datasets are available at https://github.com/dmis-lab/GeNER.

Original languageEnglish
Pages6220-6236
Number of pages17
Publication statusPublished - 2022
Event2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022 - Abu Dhabi, United Arab Emirates
Duration: 2022 Dec 72022 Dec 11

Conference

Conference2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022
Country/TerritoryUnited Arab Emirates
CityAbu Dhabi
Period22/12/722/12/11

Bibliographical note

Funding Information:
We thank Jungsoo Park, Gyuwan Kim, Mujeen Sung, Sungdong Kim, Yonghwa Choi, Won-jin Yoon, and Gangwoo Kim for the helpful feedback. This research was supported by (1) National Research Foundation of Korea (NRF-2020R1A2C3010638), (2) the MSIT (Ministry of Science and ICT), Korea, under the ICT Creative Consilience program (IITP-2021-2020-0-01819) supervised by the IITP (Institute for Information & communications Technology Planning & Evaluation), and (3) a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (grant number: HR20C0021).

Publisher Copyright:
© 2022 Association for Computational Linguistics.

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Computer Science Applications
  • Information Systems

Fingerprint

Dive into the research topics of 'Simple Questions Generate Named Entity Recognition Datasets'. Together they form a unique fingerprint.

Cite this