Assessment can be a labor-intensive endeavor for numerous educators. Tools powered by artificial intelligence might assist in alleviating this burden, as indicated by a recent research from the University of Georgia.
Various states have embraced the Next Generation Science Standards, which highlight the significance of argumentation, inquiry, and data interpretation. However, instructors adhering to this curriculum encounter difficulties when grading student assignments.
“Requesting students to develop a model, articulate an explanation, or debate with peers are quite intricate tasks,” remarked Xiaoming Zhai, the lead author of the study and an associate professor and director of the AI4STEM Education Center at UGA’s Mary Frances Early College of Education. “Often, educators lack sufficient time to evaluate all student responses, which results in students missing out on prompt feedback.”
AI operates swiftly but bases assessments on heuristics
The research examined how Large Language Models (LLMs) evaluate student submissions in contrast to human assessors. LLMs are a form of AI trained on vast amounts of information, predominantly sourced from the internet, allowing them to “comprehend” and produce human language.
For this investigation, the LLM Mixtral was given written answers from middle school pupils. One inquiry required students to construct a model demonstrating the behavior of particles when exposed to heat energy. A correct response would reveal that molecules slow down when cold and speed up when hot.
Mixtral subsequently created rubrics to evaluate student performance and assign final grades.
“We still have a considerable distance to cover in leveraging AI, and we must determine the appropriate direction moving forward.” —Xiaoming Zhai, College of Education
The findings revealed that while LLMs could grade answers rapidly, they frequently utilized shortcuts such as identifying specific keywords and presuming comprehension of a topic. This approach diminished their reliability in assessing the students’ understanding of the subject matter.
The research indicates that LLMs could enhance their accuracy by being provided with rubrics that reflect the deep, analytical reasoning employed by humans during grading. These rubrics ought to include precise criteria detailing what the evaluator expects in a student’s response. The LLM could then assess the answer based on the guidelines established by the human evaluator.
“The train has departed, but it has only just departed,” Zhai stated. “This implies we still have significant progress to make in our use of AI, and we need to ascertain which direction to pursue.”
LLMs and human evaluators vary in their assessment methodologies
Conventionally, LLMs receive both the students’ responses and the scores from human evaluators for training purposes. In this instance, however, LLMs were directed to create their own rubric for evaluating student answers.
The researchers discovered that the rubrics produced by LLMs bore some resemblance to those created by humans. While LLMs generally grasp the essence of the questions posed to students, they lack the ability to reason as human beings do.
Instead, LLMs predominantly depend on shortcuts, a phenomenon described by Zhai as “over-inferring.” This occurs when an LLM presumes a student has comprehension that a human educator may not conclude.
For instance, LLMs might classify a student’s response as accurate if it contains specific key terms but fail to analyze the logic behind the student’s reasoning.
“Students might refer to an increase in temperature, and the large language model assumes that all students understand that the particles are moving more rapidly when temperatures rise,” mentioned Zhai. “However, based on the student’s written work, we, as humans, cannot infer whether the students grasp that the particles will indeed accelerate.”
LLMs particularly rely on shortcuts when confronted with examples of graded responses that lack explanations for the grading decisions made.
Humans remain essential in automated assessments
Although LLMs are quick, the researchers caution against entirely substituting human graders.
Human-generated rubrics typically encompass a framework that mirrors the expectations an instructor has for student responses. In the absence of these rubrics, LLMs exhibit only a 33.5% accuracy rate. However, when provided access to human-made rubrics, this accuracy rate improves to slightly over 50%.
If the precision of LLMs can be further enhanced, educators may become more inclined to incorporate this technology to optimize their grading procedures.
“Numerous teachers have expressed to me, ‘I had to dedicate my weekend to providing feedback, but with automatic scoring, I no longer need to do that. Now, I can invest more time in meaningful work rather than labor-intensive tasks,’” Zhai remarked. “This is incredibly motivating for me.”
The research was published in Technology, Knowledge and Learning and co-authored by Xuansheng Wu, Padmaja Pravin Saraf, Gyeonggeon Lee, Eshan Latif, and Ninghao Liu.
The article AI may expedite the grading process for educators initially appeared on UGA Today.