Evaluating Large Language Models as AP Essay Scorers

My history students recently tested the potential of several leading large language models (LLMs) to do the work of Advanced Placement U.S. History essay graders. The models were provided with College Board scoring guidelines and were prompted to score sample College Board U.S. History exam responses. For each essay, the models made decisions on awarding points for thesis, context, evidence, reasoning, and analysis, and also provided scoring rationales. These grading decisions and the rationales were then compared to College Board scoring commentary. Alignment between the LLMs and the College Board commentary was measured in two ways: (1) how often the models matched College Board point allocations and (2) how closely the rationales conveyed meaning similar to College Board rationales. More about the methodology and how the models performed at the project website, apush.omeka.net/2025.

Published by raypalin