← Back to incidents

Google Cloud AI Platform Outage Causes Widespread Service Disruptions

High

Google Cloud AI Platform outage in May 2023 caused widespread disruptions to businesses dependent on AI APIs, highlighting infrastructure fragility and the need for resilient AI service architectures.

Category
Safety Failure
Industry
Technology
Status
Resolved
Date Occurred
May 8, 2023
Date Reported
May 8, 2023
Jurisdiction
International
AI Provider
Google
Application Type
api integration
Harm Type
operational
Estimated Cost
$15,000,000
People Affected
50,000
Human Review in Place
No
Litigation Filed
Yes
Litigation Status
settled
infrastructurecloud_servicescascading_failureservice_outageoperational_riskbusiness_continuity

Full Description

On May 8, 2023, Google Cloud experienced a significant outage affecting its AI Platform and Machine Learning services that cascaded to disrupt numerous businesses across multiple industries. The incident began at approximately 10:30 AM PDT when a routine configuration update to Google's AI inference infrastructure triggered unexpected failures across multiple availability zones in the us-central1 region. The outage primarily affected Google Cloud AI Platform's prediction endpoints, AutoML services, and Vertex AI APIs, which are critical components for businesses running AI-powered applications. Companies relying on real-time AI inference for customer-facing services, including e-commerce recommendation engines, content moderation systems, and chatbots, experienced immediate service degradation or complete failures. Financial services firms using Google's AI APIs for fraud detection reported delayed transaction processing, while healthcare applications dependent on medical AI models faced disruptions in diagnostic workflows. The cascading nature of the failure became apparent as secondary systems began failing. Load balancers attempting to route traffic to healthy AI service endpoints became overwhelmed, causing broader network congestion. Google's internal monitoring systems initially failed to detect the full scope of the issue due to the configuration change affecting telemetry collection. The company's incident response team was alerted through customer reports rather than automated systems, delaying the initial response by approximately 45 minutes. Recovery efforts were complicated by the distributed nature of the AI Platform infrastructure. Google engineers had to manually roll back configuration changes across hundreds of servers while simultaneously managing traffic rerouting to unaffected regions. Full service restoration took nearly 6 hours, with some AI model endpoints remaining intermittently unavailable for up to 12 hours. During the outage, affected customers reported revenue losses ranging from thousands to millions of dollars, particularly those in retail and financial services sectors where AI-driven decision making is critical for operations. The incident exposed the concentration risk of cloud AI services and prompted many organizations to reconsider their dependency on single AI service providers. Google subsequently implemented enhanced testing procedures for infrastructure changes, improved monitoring systems, and offered service level agreement credits to affected customers. The company also accelerated development of multi-region failover capabilities for AI services.

Root Cause

A configuration change in Google's AI Platform infrastructure caused cascading failures across multiple zones, affecting ML model inference endpoints and disrupting dependent applications that relied on real-time AI processing.

Mitigation Analysis

Multi-cloud deployment strategies with failover capabilities could have reduced impact. Implementing circuit breakers and graceful degradation modes in AI-dependent applications would have prevented complete service failures. Enhanced monitoring of AI service dependencies and automated rollback procedures could have shortened recovery time.

Litigation Outcome

Google settled multiple class action suits from affected businesses with undisclosed compensation and improved SLA terms

Lessons Learned

The incident demonstrated the critical importance of designing AI-dependent systems with resilience and failover capabilities, as single points of failure in cloud AI infrastructure can have widespread cascading effects across industries and highlight the need for diversified AI service strategies.