AI Grading Tool Markr Produced Wildly Inconsistent Scores for Identical Essays

Medium

AI essay grading tool Markr demonstrated severe inconsistency by giving wildly different scores to identical essays and was easily manipulated through superficial text changes, raising concerns about AI reliability in educational assessment.

Full Description

In September 2023, investigations into AI-powered essay grading systems revealed significant reliability issues with automated assessment tools used in educational settings. The AI grading platform Markr, which has been adopted by educational institutions for essay evaluation, was found to produce dramatically inconsistent scores when the same essay was submitted multiple times through the system. Testing revealed that identical essays could receive vastly different grades depending on when and how they were submitted to the AI system. More troubling was the discovery that superficial modifications to essays—such as replacing simple words with more sophisticated vocabulary or merely increasing the essay length without improving content quality—could dramatically inflate scores regardless of the actual merit of the writing. The inconsistency appeared to stem from the AI model's inability to maintain stable evaluation criteria across different assessment sessions. Researchers found that the system was overly influenced by surface-level features like word complexity and document length rather than focusing on substantive elements like argument quality, evidence usage, and logical coherence that human educators prioritize. This revelation raised serious concerns about the validity of grades assigned through AI systems and the potential for students to game the system by making cosmetic changes to their work. Educational institutions using such tools faced questions about the fairness and accuracy of their assessment processes, particularly for high-stakes evaluations that could impact student academic progress and outcomes. The incident highlighted broader challenges in developing AI systems for educational assessment, where consistency, fairness, and resistance to manipulation are critical requirements. The findings suggested that current AI grading technology may not be sufficiently mature for unsupervised use in educational settings without significant human oversight and validation processes.

Root Cause

The AI grading system lacked consistent evaluation criteria and was susceptible to manipulation through superficial text modifications like vocabulary substitution and length changes, indicating poor model training on actual content quality assessment.

Mitigation Analysis

Implementation of multiple AI model consensus scoring, mandatory human review for high-stakes assessments, and robust testing protocols using identical essays with known quality benchmarks could have identified these inconsistencies. Regular calibration testing with essays modified only superficially would reveal gaming vulnerabilities before deployment.

Lessons Learned

AI grading systems require extensive validation testing for consistency and gaming resistance before deployment in educational settings. The incident demonstrates the critical need for human oversight in high-stakes assessment scenarios and highlights the risk of over-relying on automated evaluation without proper calibration and quality assurance measures.

Sources

AI essay-marking tools give wildly different scores for the same work, university research finds

ABC News Australia · Sep 6, 2023 · news

AI essay-marking tools unreliable and can be gamed, research finds

The Guardian · Sep 6, 2023 · news