TY - CONF
T1 - Simple Questions Generate Named Entity Recognition Datasets
AU - Kim, Hyunjae
AU - Yoo, Jaehyo
AU - Yoon, Seunghyun
AU - Lee, Jinhyuk
AU - Kang, Jaewoo
N1 - Funding Information:
We thank Jungsoo Park, Gyuwan Kim, Mujeen Sung, Sungdong Kim, Yonghwa Choi, Won-jin Yoon, and Gangwoo Kim for the helpful feedback. This research was supported by (1) National Research Foundation of Korea (NRF-2020R1A2C3010638), (2) the MSIT (Ministry of Science and ICT), Korea, under the ICT Creative Consilience program (IITP-2021-2020-0-01819) supervised by the IITP (Institute for Information & communications Technology Planning & Evaluation), and (3) a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (grant number: HR20C0021).
Publisher Copyright:
© 2022 Association for Computational Linguistics.
PY - 2022
Y1 - 2022
N2 - Recent named entity recognition (NER) models often rely on human-annotated datasets, requiring the significant engagement of professional knowledge on the target domain and entities. This research introduces an ask-to-generate approach that automatically generates NER datasets by asking questions in simple natural language to an open-domain question answering system (e.g., “Which disease?”). Despite using fewer in-domain resources, our models, solely trained on the generated datasets, largely outperform strong low-resource models by an average F1 score of 19.4 for six popular NER benchmarks. Furthermore, our models provide competitive performance with rich-resource models that additionally leverage in-domain dictionaries provided by domain experts. In few-shot NER, we outperform the previous best model by an F1 score of 5.2 on three benchmarks and achieve new state-of-the-art performance. The code and datasets are available at https://github.com/dmis-lab/GeNER.
AB - Recent named entity recognition (NER) models often rely on human-annotated datasets, requiring the significant engagement of professional knowledge on the target domain and entities. This research introduces an ask-to-generate approach that automatically generates NER datasets by asking questions in simple natural language to an open-domain question answering system (e.g., “Which disease?”). Despite using fewer in-domain resources, our models, solely trained on the generated datasets, largely outperform strong low-resource models by an average F1 score of 19.4 for six popular NER benchmarks. Furthermore, our models provide competitive performance with rich-resource models that additionally leverage in-domain dictionaries provided by domain experts. In few-shot NER, we outperform the previous best model by an F1 score of 5.2 on three benchmarks and achieve new state-of-the-art performance. The code and datasets are available at https://github.com/dmis-lab/GeNER.
UR - http://www.scopus.com/inward/record.url?scp=85149440268&partnerID=8YFLogxK
M3 - Paper
AN - SCOPUS:85149440268
SP - 6220
EP - 6236
T2 - 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022
Y2 - 7 December 2022 through 11 December 2022
ER -