UPDATE WITH THE ROOT CAUSE A single Availability Zone in the EU-CENTRAL-1 region of Amazon Web Services has experienced a major incident.
Of the society status page says the incident started at 1:24 p.m. PDT (8:30 p.m. UTC) on June 10 and initially caused “connectivity issues for some EC2 instances.”
Half an hour later, AWS reported “increased API error rates and latencies for EC2 APIs and connectivity issues for instances… caused by an increase in ambient temperature in a subsection of the affected Availability Zone “.
At 2:36 p.m. PDT, AWS said temperatures were dropping, but network connectivity was still down.
But an hour later, the Colossus of Clouds made the following rather disturbing toll:
An update at 4:12 p.m. reported that staff were still unable to enter the site for security reasons.
At 4:33 p.m. network services were restored, an event according to AWS should lead to rapid recovery of EC2 instances. A 5:19 pm update stated that “environmental conditions in the affected Availability Zone are now back to normal” and informed users that “the vast majority of affected EC2 instances have now fully recovered, but we are continuing. to work on certain EBS volumes which continue to experience degraded performance.
Kinesis Data Streams, Kinesis Firehose, Amazon Relational Database Service, and AWS CloudFormation also faltered.
The latest AWS status update concluded, “We will provide more details on the root cause in a later article, but we can confirm that there was no fire at the facility. “
Which leaves the question of what made the data center too dangerous to access it?
Although we lack evidence on which to base a claim, The register reported UPS eruptions and tiny puffs of smoke leading to the release of hypoxic gases in data centers.
The goal of releasing hypoxic gas in data centers is to deprive fires of oxygen. And since humans need oxygen, it may take some time before engineers can return to a data center.
The register mentions this because it matches the facts presented in this incident, and with AWS language of “environmental conditions” preventing entry.
We will update this story if new information about this incident reaches us.
UPDATE 2:45 UCT June 11. AWS updated its incident report (and most importantly proved that our analysis was correct) by revealing that the incident was caused by “the failure of a control system which deactivated several air managers in the affected availability zone ”.
Air handling units cool the data center. So after they stopped working, “the ambient temperatures started to rise” to dangerous levels, so the AWS Server Networking Kit shut down.
“Unfortunately, because this issue impacted multiple redundant network switches, more EC2 instances in this single Availability Zone lost network connectivity,” the update adds.
And now for the part we can be happy with:
“While our operators could normally have restored pre-impact cooling, a fire suppression system activated in a section of the affected Availability Zone.
“When this system activates, the data center is evacuated and sealed off, and a chemical is dispersed to remove oxygen from the air to extinguish any fires.”
AWS staff had to wait for the arrival of local firefighters and certify that the building was safe. Once this approval was obtained, AWS said that “the building must be re-oxygenated before engineers can safely enter the facility and restore the network equipment and affected servers.”
Safe working conditions have been restored, as have most equipment and services.
But it looks like some kits have been damaged, as AWS stated, “A very small number of remaining instances and volumes that have been affected by rising ambient temperatures and loss of power remain unresolved.”
The cloud giant also let customers know that the fire extinguisher system that activated remains deactivated. ®