AWS Outage 2025: DNS Failure Disrupts Internet Globally

When a single point of failure brings down much of the internet, it exposes the hidden vulnerabilities in our cloud-dependent digital infrastructure. On October 20, 2025, Amazon Web Services experienced a DNS failure that disrupted over 2,500 companies and affected millions of users a stark reminder of how concentrated power in cloud computing creates systemic risk.

Early Monday morning, internet users worldwide encountered an unwelcome surprise: their favorite apps, banking services, and streaming platforms simply stopped working. From Snapchat to Fortnite, from Ring doorbell cameras to Venmo payments, a cascade of digital services went dark. The culprit? A domain name system failure at Amazon Web Services’ US-EAST-1 region in Northern Virginia the internet’s most critical data center hub.

Key Takeaways

The October 2025 Amazon Web Services outage serves as a critical wake-up call about the hidden vulnerabilities in our cloud-dependent digital infrastructure. When a DNS failure in a single data center region can disrupt over 2,500 companies and affect millions of users worldwide, the systemic risks of centralized cloud computing become impossible to ignore.

For businesses, the message is clear: relying solely on a single cloud provider creates unacceptable risk. Implementing multi-cloud strategies, establishing robust disaster recovery plans, and maintaining geographic redundancy aren’t optional extras they’re essential safeguards for business continuity in an increasingly cloud-dependent world.

As we continue moving critical infrastructure and services to the cloud, the industry must grapple with fundamental questions about resilience, redundancy, and the sustainability of consolidating so much power in so few companies. The next major outage isn’t a matter of if, but when and organizations that prepare now will be far better positioned to weather the storm.

Understanding the Amazon Web Services Outage

What Happened During the AWS Outage

The disruption began at approximately 12:11 AM PDT (3:11 AM Eastern Time) when Amazon Web Services first reported “increased error rates and latencies for multiple AWS services” in their US-EAST-1 region. Within roughly 90 minutes, AWS engineers identified the root cause: a DNS resolution failure affecting the DynamoDB API endpoint.

This seemingly technical glitch had massive real-world consequences. DynamoDB, Amazon’s high-performance NoSQL database service, functions as what cybersecurity expert Mike Chapple described as “the system of record—the memory of the internet”. When the DNS system couldn’t properly translate DynamoDB’s web address into the IP address needed for connections, applications lost access to their essential data storage.

By 2:24 AM PDT, AWS had implemented initial mitigations that showed signs of recovery. However, the underlying DNS problem wasn’t fully resolved until around 6:35 AM Eastern Time. Even then, AWS warned that “some services will have a backlog of work to process, which may require additional time to fully address”.

The Scale of Disruption

The outage’s reach was staggering. DownDetector, a website that tracks online service disruptions, recorded over 11 million user reports affecting more than 2,500 companies worldwide. Services spanning multiple industries ground to a halt:

Social Media and Communication: Snapchat, Signal, Reddit, WhatsApp, and Slack experienced widespread outages, leaving users unable to connect with friends, family, or colleagues.

Gaming Platforms: Popular games including Fortnite, Roblox, and Pokémon GO became inaccessible, disrupting millions of players globally.

Financial Services: Cryptocurrency exchange Coinbase, investment app Robinhood, payment processor Venmo, and UK banks including Lloyds, Halifax, and Bank of Scotland all reported issues. This prevented users from executing trades, transferring money, or accessing their accounts.

E-commerce and Delivery: Amazon.com itself struggled with connectivity issues, while delivery and logistics platforms faced payment processing problems due to banking complications.

Education Technology: The Canvas learning management system went down, preventing students from accessing coursework and teachers from grading assignments.

Amazon’s Own Ecosystem: Even Amazon’s proprietary services weren’t immune. Ring doorbell cameras stopped recording, Alexa-powered smart speakers couldn’t respond to voice commands, and Kindle users couldn’t download books.

Why This Outage Matters: The Hidden Fragility of Cloud Infrastructure

The Single Point of Failure Problem

This incident illuminates a critical vulnerability in modern internet architecture: the concentration of digital infrastructure among just a handful of massive cloud providers. Amazon Web Services alone commands approximately 30-31% of the global cloud infrastructure market as of Q2 2025, generating $30.9 billion in quarterly revenue. When combined with Microsoft Azure (20% market share) and Google Cloud (12-13%), these three companies control more than 60% of the world’s cloud computing capacity.

Patrick Burgess, a cybersecurity expert at BCS, The Chartered Institute for IT, explained the systemic risk: “So much of the world now relies on these three or four big compute companies who provide the underlying infrastructure that when there’s an issue like this, it can be really impactful across a broad range, a broad spectrum of online services”.

This centralization creates what security professionals call a “single point of failure”. When AWS experiences problems, the ripple effects cascade through thousands of dependent services simultaneously. Unlike the internet’s original decentralized design intended to withstand localized failures, today’s cloud-dependent infrastructure can experience widespread collapse from a single regional issue.

The Economic Cost of Cloud Downtime

The financial impact of AWS outages extends far beyond Amazon itself. According to analysis by Tenscope, major websites collectively lose approximately $75 million per hour during widespread cloud disruptions. The breakdown reveals Amazon bearing the largest share at $72.8 million per hour, followed by Snapchat ($612,000/hour), Zoom ($533,000/hour), Roblox ($411,000/hour), and Fortnite ($400,000/hour).

Mehdi Daoudi, CEO of internet performance monitoring company Catchpoint, told Al Jazeera that the economic impact of this October 2025 incident could “reach into hundreds of millions of dollars” due to lost productivity and suspended business activities. For context, a similar 2024 Crowdstrike outage caused Fortune 500 companies approximately $5.4 billion in losses.

Financial institutions face particularly acute risks. In an industry where success is measured in milliseconds and continuous global connectivity is essential, even brief outages can trigger massive losses. During the AWS disruption, trading platforms like Robinhood and Coinbase couldn’t execute transactions, banks couldn’t process payments, and wealth management systems went offline.

Understanding DNS and DynamoDB: The Technical Root Cause

What Is DNS and Why Does It Matter?

To understand why this outage happened, you need to grasp how the Domain Name System works. DNS functions as the internet’s phonebook, translating human-readable domain names (like “amazon.com”) into the numerical IP addresses (like “93.184.216.34”) that computers use to locate and connect to websites.

Every time you type a web address, visit an app, or stream a video, your device queries a DNS server to find the correct IP address. This translation process typically happens in milliseconds and goes completely unnoticed—until it breaks.

When DNS fails, the translation step breaks down. Your device knows you want to reach “amazon.com,” but it can’t find the directions to get there. It’s like having a street address but no map to navigate to the location.

DynamoDB’s Critical Role

Amazon DynamoDB is a fully managed, serverless NoSQL database service designed to deliver single-digit millisecond performance at any scale. It’s not consumer-facing software that most people recognize, but it’s fundamental to countless applications and services used daily.

DynamoDB stores and retrieves massive amounts of data with extremely low latency. Major platforms use it for everything from shopping cart functionality to user authentication systems. During events like Amazon Prime Day, DynamoDB APIs handle trillions of calls, supporting Alexa, Amazon.com, and all Amazon fulfillment centers.

Because DynamoDB is what experts call a “foundational service,” many other AWS offerings depend on it to function properly. When the DNS system couldn’t resolve DynamoDB’s API endpoint address, it triggered cascading failures across 64 internal AWS services. Applications couldn’t access their stored data, authentication systems couldn’t verify users, and real-time services lost their memory.

Cybersecurity expert Mike Chapple elaborated on the impact: “While Amazon had the data secured, it was temporarily inaccessible to users for several hours, cutting off applications from their necessary information. It’s like a significant portion of the internet suffering from brief amnesia”.

AWS Outage History: A Pattern of Vulnerability

Previous Major Disruptions

This wasn’t Amazon’s first rodeo with widespread outages, and it likely won’t be the last. AWS has experienced several significant disruptions over the past decade:

2017 (February 28): An S3 storage outage in US-EAST-1 lasted several hours after an employee accidentally deleted more servers than intended while running a routine maintenance playbook. The incident affected numerous websites including Medium, Slack, Imgur, and Trello. Notably, AWS’s own status dashboard initially failed to update properly because it depended on S3.

2020 (November 25): Kinesis Data Streams API became impaired in US-EAST-1 beginning at 9:52 AM PST, preventing customers from reading or writing data.

2021 (December 7): An impairment of several network devices in US-EAST-1 caused widespread errors across all AWS services for more than five hours—AWS’s longest recent outage. The disruption affected everything from airline reservations and auto dealerships to payment apps and video streaming services.

2023 (June 13): AWS Lambda experienced increased error rates and latencies in the Northern Virginia region for several hours.

2025 (October 20): The DNS/DynamoDB failure discussed in this article affected over 2,500 companies and millions of users globally.

Why US-EAST-1 Is Particularly Critical

The US-EAST-1 region in Northern Virginia appears repeatedly in AWS outage histories, and for good reason. As AWS’s original and largest data center location, it serves as a default region for many services and hosts critical infrastructure supporting operations across the United States and Europe.

One Reddit user familiar with AWS architecture noted: “The heavy dependence of AWS on the US-EAST-1 region has been a significant vulnerability almost since its inception, leading to many local problems escalating into global crises”. When something fails in US-EAST-1, the impact often cascades globally because so many core services and global features rely on endpoints in this region.

How to Protect Your Business from Cloud Outages

Implement Multi-Cloud and Hybrid Strategies

The most effective defense against single-provider outages is distributing your workloads across multiple cloud platforms. A multi-cloud strategy means using services from AWS, Microsoft Azure, and Google Cloud simultaneously, ensuring that if one provider experiences downtime, your critical operations can continue on another platform.

According to a 2022 Cisco survey of 2,577 IT decision-makers, 73% of organizations utilize hybrid cloud solutions (combining public and private cloud infrastructure) specifically for backup and disaster recovery. This approach keeps critical workloads operational on private cloud infrastructure even when public cloud services fail.

Key multi-cloud best practices include:

Use a single Infrastructure as Code (IaC) framework like Terraform across all cloud providers rather than maintaining separate templates for each platform
Automate cross-cloud backup policies to ensure data redundancy without manual intervention
Distribute workloads across multiple geographic regions to protect against regional disasters
Implement automated failover mechanisms that detect outages and redirect traffic to healthy systems

Establish Robust Disaster Recovery Plans

A comprehensive disaster recovery plan defines exactly how your organization will respond when cloud services fail. Essential components include:

Recovery Time Objective (RTO): How quickly you need to restore operations after an outage

Recovery Point Objective (RPO): How much data loss is acceptable in a disaster scenario

Backup strategies: Automated, regularly scheduled backups stored in multiple geographic locations, preferably across different providers

Redundancy architecture: Duplicate systems in different availability zones or regions that can take over seamlessly

Testing protocols: Regular drills and simulations to ensure your recovery procedures actually work when needed

Mike Chapple’s observation about the AWS recovery process highlights why planning matters: “As engineers roll out fixes across the cloud computing infrastructure, the process could trigger smaller disruptions. It’s similar to what happens after a large-scale power outage: While a city’s power is coming back online, neighborhoods may see intermittent glitches as crews finish the repairs”.

Monitor Performance and Set Up Alerts

Proactive monitoring can detect warning signs before minor issues escalate into full outages. Deploy monitoring tools that continuously track:

System health metrics including CPU usage, memory, disk space, and network connectivity
Service performance indicators such as response times, error rates, and throughput
Security anomalies that might indicate attacks or unauthorized access

Configure automated alerts that notify your team immediately when problems arise, enabling faster response to prevent or minimize outages.

Review and Strengthen Service Level Agreements

When selecting cloud providers, scrutinize their Service Level Agreements (SLAs) carefully. Look for:

Clear uptime guarantees of 99.99% or higher
Penalty clauses that provide financial compensation if the provider fails to meet SLA commitments
24/7 support availability for rapid issue resolution
Transparent incident reporting and post-mortem analysis

An SLA guaranteeing 99.99% uptime allows only 52.6 minutes of downtime per year, while 99.999% uptime permits just 5.25 minutes annually. Higher availability tiers typically cost more, but they can significantly reduce the financial impact of outages.

Train Your Team on Cloud Best Practices

Human error causes many cloud misconfigurations and security vulnerabilities. Regular training ensures your team understands:

Cloud security best practices to safeguard against breaches and attacks
Efficient resource management to prevent performance bottlenecks
Incident response procedures for quickly identifying and fixing issues
Role-based access control to minimize accidental mistakes that could cause outages

The Broader Implications: Rethinking Internet Infrastructure

Is Centralization Sustainable?

The October 2025 AWS outage raises fundamental questions about the sustainability of our increasingly centralized internet infrastructure. When approximately one-third of the internet can go dark because of a DNS failure in a single data center region, the fragility of the system becomes undeniable.

Daniel Ramirez, product director at Downdetector, observed that large-scale outages “are likely becoming slightly more frequent as organizations are encouraged to rely entirely on cloud services and design their data architectures to optimize a specific cloud platform”. This trend toward consolidation may improve efficiency and reduce costs during normal operations, but it amplifies systemic risk during failures.

The internet was originally designed as a decentralized, resilient network capable of routing around damage. However, economic forces have driven consolidation around a few massive cloud providers who can achieve economies of scale. As Davi Ottenheimer, vice president at data infrastructure firm Inrupt, noted: “The reliance on centralized cloud services from major players such as AWS, Microsoft, and Google Cloud, in many respects, enhanced cybersecurity and stability worldwide by establishing a baseline of standards and best practices. However, this standardization presents significant trade-offs, as it creates a single point of failure for many essential services”.

Regulatory Attention and Future Safeguards

Financial regulators have begun expressing concern about the concentration of critical operations with single cloud providers. The October outage, which disrupted trading platforms, banks, and payment systems globally, highlights the systemic risk this dependency creates for the financial sector.

Luke Kehoe, an analyst at Candriam, emphasized the lesson for organizations: “The takeaway here is resilience. Many organizations centralize workloads in a single region. Distributing essential applications and data across multiple regions and availability zones can significantly diminish the impact of future incidents”.

Frequently Asked Questions About the AWS Outage

What caused the Amazon Web Services outage in October 2025?

The outage was caused by a DNS (Domain Name System) resolution failure affecting the DynamoDB API endpoint in AWS’s US-EAST-1 region. This prevented the DNS system from properly translating DynamoDB’s web address into the IP address needed for connections, cutting off applications from their essential data storage.

How long did the AWS outage last?

The outage began at approximately 12:11 AM PDT (3:11 AM ET) on October 20, 2025. Initial mitigations were implemented by 2:24 AM PDT, and the underlying DNS problem was resolved around 6:35 AM ET. However, full recovery took several additional hours as services worked through backlogs, with some users experiencing issues for 6-8 hours total.

Which services were affected by the AWS outage?

Over 2,500 companies and services were impacted. Major affected platforms included Snapchat, Fortnite, Roblox, Signal, Zoom, Coinbase, Robinhood, Venmo, Reddit, Canva, McDonald’s app, Canvas education platform, and Amazon’s own services including Alexa, Ring, and Kindle. Banking services at Lloyds, Halifax, and Bank of Scotland also experienced disruptions.

How much money was lost due to the AWS outage?

According to Tenscope estimates, major websites collectively lost approximately $75 million per hour during the outage, with Amazon alone accounting for $72.8 million per hour. Experts suggest the total economic impact could reach hundreds of millions of dollars when accounting for lost productivity, suspended transactions, and business disruptions across all affected companies.

How can businesses protect themselves from cloud outages?

Organizations can reduce vulnerability by implementing multi-cloud strategies that distribute workloads across multiple providers, establishing comprehensive disaster recovery plans with clear RTO and RPO targets, deploying continuous monitoring and alerting systems, backing up critical data to multiple geographic locations, and reviewing SLAs to ensure adequate uptime guarantees and compensation clauses.

How common are AWS outages?

While AWS generally maintains high reliability, significant outages occur several times per decade. Major disruptions happened in 2017 (S3 outage), 2020 (Kinesis impairment), 2021 (network device failure lasting 5+ hours), 2023 (Lambda issues), and 2025 (DNS/DynamoDB failure). Daniel Ramirez from Downdetector noted that large-scale outages “where a foundational internet service disrupts many online services, only occurs a few times a year”.

Why is the US-EAST-1 region so important?

US-EAST-1 in Northern Virginia is AWS’s original and largest data center region. It serves as the default region for many services and hosts critical infrastructure supporting operations globally. Many global AWS services and features depend on US-EAST-1 endpoints, which means failures in this region often have cascading effects worldwide.

What is DynamoDB and why did its failure cause such widespread problems?

DynamoDB is Amazon’s fully managed, serverless NoSQL database service that delivers single-digit millisecond performance at massive scale. It’s a “foundational service” that many other AWS offerings depend on for data storage, user authentication, and real-time functionality. When applications couldn’t access DynamoDB due to the DNS failure, cascading problems affected 64 internal AWS services.

Search for an article

Amazon Web Services Outage Takes Down Thousands of Online Services Worldwide: What You Need to Know