← Back to incidents

AI Essay Grading Systems Systematically Penalize Non-Native English Speakers

High

Research revealed AI essay grading systems like e-rater systematically gave lower scores to non-native English speakers despite equivalent content quality, affecting standardized test outcomes for international students.

Category
Bias
Industry
Education
Status
Ongoing
Date Occurred
Jan 1, 2019
Date Reported
Apr 15, 2019
Jurisdiction
US
AI Provider
Other/Unknown
Model
e-rater
Application Type
embedded
Harm Type
operational
People Affected
200,000
Human Review in Place
Yes
Litigation Filed
No
educational_biasautomated_scoringlanguage_discriminationstandardized_testingESL_students

Full Description

Multiple AI-powered essay grading systems used in high-stakes standardized testing, including Educational Testing Service's e-rater and Vantage Learning's IntelliMetric, were found to exhibit systematic bias against non-native English speakers. Academic research published in 2019 demonstrated that these systems consistently assigned lower scores to essays written by non-native speakers, even when the content quality, argumentation, and ideas were equivalent to those of native speakers. The bias manifested through the algorithms' heavy weighting of linguistic features such as sentence complexity, vocabulary sophistication, and grammatical structures that favor native English writing patterns. Non-native speakers, despite demonstrating strong content knowledge and critical thinking skills, were penalized for using simpler sentence structures, more common vocabulary, or slight grammatical variations that are typical of second-language acquisition patterns. The impact was particularly significant for international students taking standardized tests like the GRE, TOEFL, and state assessment exams, where AI grading was increasingly being used to supplement or replace human scorers. Research indicated that approximately 200,000 test-takers annually could be affected by these scoring disparities, with potential consequences for college admissions, scholarship opportunities, and academic placement decisions. Educational Testing Service and other testing companies initially defended their systems, arguing that language proficiency was a legitimate component of writing assessment. However, researchers demonstrated that the bias persisted even when controlling for overall English proficiency levels, suggesting the algorithms were not accurately measuring writing quality but rather privileging specific linguistic styles associated with native speakers. The controversy intensified debates about the appropriate role of AI in educational assessment, particularly for diverse student populations. Critics argued that automated scoring systems risked perpetuating educational inequalities by systematically disadvantaging students from non-English speaking backgrounds, while proponents emphasized the need for consistent and scalable assessment methods in an era of increasing test volume.

Root Cause

AI grading algorithms were trained primarily on native English speaker writing samples and weighted linguistic features like sentence complexity and vocabulary sophistication over content quality, creating systematic bias against non-native speakers' different linguistic patterns.

Mitigation Analysis

Bias could have been reduced through diverse training data including non-native speaker samples, bias testing across demographic groups, and content-focused scoring rubrics that de-emphasize linguistic complexity. Regular algorithmic auditing and human oversight for score discrepancies would help identify systematic bias patterns before deployment.

Lessons Learned

The incident highlights the critical importance of bias testing across diverse demographic groups before deploying AI systems in high-stakes educational contexts, and demonstrates how seemingly objective automated scoring can perpetuate systemic inequalities.