← Back to incidents

ETS e-rater Automated Essay Scoring System Exhibited Gaming Vulnerabilities and Dialect Bias

High

MIT researcher Les Perelman demonstrated that ETS's e-rater automated essay scoring system used for GRE and GMAT could be gamed with verbose nonsense while showing bias against non-standard English dialects.

Category
Bias
Industry
Education
Status
Resolved
Date Occurred
Jan 1, 2012
Date Reported
May 1, 2012
Jurisdiction
US
AI Provider
Other/Unknown
Model
e-rater
Application Type
embedded
Harm Type
operational
People Affected
700,000
Human Review in Place
Yes
Litigation Filed
No
educational_assessmentalgorithmic_biastest_fairnessNLPgaming_vulnerabilitylinguistic_biasautomated_scoring

Full Description

In 2012, MIT researcher Les Perelman published groundbreaking research exposing critical flaws in Educational Testing Service's (ETS) e-rater automated essay scoring system, which was used to evaluate essays on the Graduate Record Examination (GRE) and Graduate Management Admission Test (GMAT). The e-rater system, developed by ETS and used alongside human graders, was designed to provide consistent and efficient scoring of written responses from hundreds of thousands of test takers annually. Perelman's research revealed that the e-rater system could be systematically gamed through verbose but semantically meaningless writing. He demonstrated that essays filled with complex vocabulary, long sentences, and sophisticated-sounding but nonsensical content consistently received higher scores than coherent, well-reasoned shorter essays. In one notable example, an essay that began with factually incorrect statements and contained logical contradictions throughout received a score of 5 out of 6 simply because it was long and used advanced vocabulary. The system appeared to prioritize surface-level linguistic features such as word count, sentence complexity, and vocabulary sophistication over actual content quality, logical reasoning, or factual accuracy. The research also uncovered systematic bias against writers using non-standard English dialects, particularly African American Vernacular English (AAVE) and other linguistic variations common among minority populations. Essays written in these dialects consistently received lower scores even when the content was substantively equivalent to essays written in Standard Academic English. This bias had profound implications for educational equity, as it potentially disadvantaged qualified candidates from diverse linguistic backgrounds in graduate school admissions processes. The technical root cause lay in e-rater's reliance on natural language processing algorithms that weighted heavily toward superficial textual features rather than semantic understanding or logical coherence. The system used statistical models trained primarily on essays written in Standard Academic English, creating an inherent bias toward that linguistic variety. ETS initially defended the system by noting that it was used in conjunction with human graders, but Perelman's work demonstrated that the algorithmic bias could influence overall scoring outcomes and perpetuate educational inequities. Following this research and subsequent criticism, ETS made modifications to the e-rater system and increased transparency about its limitations, though automated essay scoring remains a subject of ongoing debate in educational assessment.

Root Cause

The e-rater system relied heavily on surface-level features like essay length, vocabulary complexity, and syntactic patterns rather than semantic understanding, making it vulnerable to gaming through verbose nonsense and biased against dialectal variations of English.

Mitigation Analysis

The incident could have been prevented through more diverse training data including non-standard English dialects, semantic coherence validation beyond surface features, and adversarial testing with deliberately crafted nonsensical essays. Regular bias audits across demographic groups and continuous human validation of edge cases would have identified these systematic flaws earlier.

Lessons Learned

This incident highlighted the risks of deploying NLP systems without adequate bias testing across diverse populations and the danger of optimizing for easily measurable surface features rather than deeper semantic understanding in high-stakes assessment contexts.

Sources

How to game standardized writing tests
Washington Post · May 15, 2012 · news
Gaming the System
Inside Higher Ed · May 16, 2012 · news