Perplexity AI Accused of Plagiarizing Content and Fabricating Source Citations

Medium

Perplexity AI faced accusations from Forbes and other publishers for scraping protected content, bypassing robots.txt restrictions, and generating fabricated source citations while providing inadequate attribution to original creators.

Full Description

In June 2024, Forbes and multiple other major publishers accused Perplexity AI of systematically violating their content protection measures and copyright policies. The controversy centered on Perplexity's AI-powered search engine, which aggregates information from across the web to provide direct answers to user queries. Forbes documented specific instances where Perplexity had scraped and paraphrased their exclusive reporting without providing proper attribution or driving traffic back to the original articles. The accusations revealed that Perplexity was bypassing robots.txt files, which are industry-standard mechanisms that websites use to communicate crawling restrictions to automated systems. Publishers like Forbes, Wired, and Condé Nast had explicitly blocked AI crawlers in their robots.txt files, yet Perplexity continued to access and utilize their content. This raised serious questions about respect for publisher consent and established web protocols designed to protect content creators' rights. A particularly concerning aspect of the incident involved Perplexity's generation of fabricated source citations. Investigation revealed that the AI system was creating references to articles that either didn't exist or didn't contain the information being attributed to them. This practice not only misled users about the credibility of information but also potentially damaged the reputations of legitimate news organizations by associating them with false or inaccurate claims. Perplexity initially defended its practices, arguing that its technology fell within fair use parameters and that it was providing valuable summarization services. However, the company faced mounting pressure from the publishing industry and eventually acknowledged some of the concerns. The incident highlighted broader tensions between AI companies seeking to train and operate their systems on web content and publishers trying to protect their intellectual property and maintain control over how their content is used and attributed. The controversy extended beyond individual publisher complaints to raise fundamental questions about the sustainability of journalism and content creation in an AI-driven information ecosystem. Publishers argued that services like Perplexity were essentially monetizing their work while providing little to no compensation or traffic attribution, potentially undermining the economic model that supports quality journalism and content creation.

Root Cause

Perplexity AI's system bypassed robots.txt restrictions and scraped content from publishers, then used AI to paraphrase material without proper attribution while generating fabricated source citations to support its answers.

Mitigation Analysis

Implementation of robust robots.txt compliance checking, human verification of source citations before publication, and proper attribution systems with backlinks to original sources could have prevented this incident. Content provenance tracking and publisher partnership agreements would also reduce copyright violations.

Regulatory Framework References

All frameworks →

EU AI Act

Art. 53—Obligations for General-Purpose AI Models

ISO/IEC 42001

A.5.4—Legal Compliance

NIST AI RMF

GOVERN 1.2—Legal Compliance

Lessons Learned

This incident demonstrates the critical importance of respecting web standards like robots.txt and implementing proper attribution systems in AI applications that aggregate content. It also highlights the need for clear industry standards around AI content usage and the potential legal risks of fabricating source citations.

Sources

Plagiarism Concerns Mount As Perplexity AI Is Accused Of Scraping Content Without Consent

Forbes · Jun 27, 2024 · news

Perplexity Is a Bullshit Machine

Wired · Jun 28, 2024 · news