Can we Use GPT-4 as a Mathematics Evaluator in Education? Exploring the Efficacy and Limitation of LLM-based Automatic Assessment System for Open-ended Mathematics Question

  • Unggi Lee
  • , Youngin Kim
  • , Sangyun Lee
  • , Jaehyeon Park
  • , Jin Mun
  • , Eunseo Lee
  • , Hyeoncheol Kim
  • , Cheolil Lim
  • , Yun Joo Yoo*
  • *Corresponding author for this work

Research output: Contribution to journalLetterpeer-review

5 Citations (Scopus)

Abstract

This paper explores the potential of Large Language Models (LLMs), specifically GPT-4, to enhance the precision and effectiveness of Automated Assessment Systems (AAS) for open-ended mathematics problems. While LLMs have demonstrated transformative capabilities across various disciplines, their application in AAS, particularly for mathematical logic and open-ended problem-solving, still needs to be explored. Our research addresses this gap by developing and critically evaluating a GPT-4-based AAS. We analyzed 4,180 responses to open-ended mathematics questions from 380 6th-grade primary school students. Three human experts and the GPT-4 model independently assessed these responses using a pre-established rubric. Our findings reveal high consistency between human and GPT-4 assessments in most instances, highlighting the potential of integrating GPT-4 into AAS. We categorized scoring discrepancies from GPT-4 and human raters by error type and identified specific mathematical content areas where automated assessment faced limitations. We evaluated two strategies to enhance GPT-4’s assessment capabilities: (1) using elaborate prompts and (2) implementing advanced prompt engineering techniques such as Chain-of-thought, Self-consistency, and Tree-of-thought. While comprehensive prompts significantly improved assessment quality, applying advanced prompt engineering techniques directly produced suboptimal results, indicating a need for further refinement. This study contributes to the emerging body of research evaluating GPT-4 in the context of AAS for open-ended mathematics problems, shedding light on both the strengths and limitations of this approach. Our findings provide valuable insights and a foundation for future research to refine the integration of LLMs in AAS, particularly in mathematics education.

Original languageEnglish
Pages (from-to)1560-1596
Number of pages37
JournalInternational Journal of Artificial Intelligence in Education
Volume35
Issue number3
DOIs
Publication statusPublished - 2025 Sept

Bibliographical note

Publisher Copyright:
© International Artificial Intelligence in Education Society 2024.

ASJC Scopus subject areas

  • Education
  • Computational Theory and Mathematics

Fingerprint

Dive into the research topics of 'Can we Use GPT-4 as a Mathematics Evaluator in Education? Exploring the Efficacy and Limitation of LLM-based Automatic Assessment System for Open-ended Mathematics Question'. Together they form a unique fingerprint.

Cite this