We sincerely apologise for the loss of service to our users in DE-FRA1 that occurred on Saturday 29th of February 2020 at 11:33 UTC. We also apologise for the delay in publishing this postmortem as it took some time to gather and assess all the information related to the incident.
All affected cloud servers have been brought back online and all services are operating normally. Although there was an interruption of service, at no point was there any risk of data loss for our users.
While we realise this postmortem will not excuse us for the loss of service, but we wish that it explains what led to the incident and what was done to resolve it.
On the 29th of February at 9:58 UTC the facility housing our DE-FRA1 data centre experienced a short power disruption in the local power grid. The disruption was caused by a fault at the local grid provider in the east part of Frankfurt.
Backup battery power delivery buffered the short fluctuation as intended until the grid power stabilised and the data centre switched back to the main supply. However, the power fluctuation caused a fault in the cooling system which allowed temperatures inside the data centre to rise.
During a technical review following the power fluctuations, the data centre facility operators detected a fault in the cooling control system that had not restarted as expected. The cooling was first turned back on manually which halted the rise in temperatures and began cooling the server rooms. The cooling vendor was called on-site for immediate emergency repairs and the cooling control system was switched back to automatic operation.
Despite the cooling having been re-enabled, the temperatures within the data centre had exceeded safe levels. The increase in temperature reduced the effectiveness of the server hardware cooling and caused systems to shut down to protect themselves from overheating.
Once the temperatures returned back within safe operating levels, we immediately began restoring all services. After all cloud servers had been restarted and services returned to normal operating status, we continued to closely monitor the results until we were confident to call the incident resolved.
We are truly sorry for the loss of service.
10:58: Loss of grid supply was detected at the data centre facility
11:05: First check on the systems by on-site personnel
11:13: Emergency on-call informed and requested to come to the site
11:33: We identified the issue within our services which affected multiple nodes
11:35: Arrival of several site teams
11:40: Technicians attending the site performed checks on all systems
12:00: Rising temperatures detected
12:17: Cooling vendor called for on-site support
12:30: Cooling switched to manual mode by on-site personnel – temperatures decreasing
13:04: Temperatures returned back in the normal range
13:16: We finished restoring all services and continued monitoring the results
15:00: Arrival of the cooling vendor on site
15:00: The incident was marked resolved
15:20: Repair of the cooling control and switchback to automatic operation
15:30: No further deviations detected on-site and all systems back to normal operation