[DE-FRA1]: Compute nodes issue

Incident Report for UpCloud

Postmortem

DE-FRA1 Postmortem 2020-02-29

We sincerely apologise for the loss of service to our users in DE-FRA1 that occurred on Saturday 29th of February 2020 at 11:33 UTC. We also apologise for the delay in publishing this postmortem as it took some time to gather and assess all the information related to the incident.

All affected cloud servers have been brought back online and all services are operating normally. Although there was an interruption of service, at no point was there any risk of data loss for our users.

While we realise this postmortem will not excuse us for the loss of service, but we wish that it explains what led to the incident and what was done to resolve it.

On the 29th of February at 9:58 UTC the facility housing our DE-FRA1 data centre experienced a short power disruption in the local power grid. The disruption was caused by a fault at the local grid provider in the east part of Frankfurt.

Backup battery power delivery buffered the short fluctuation as intended until the grid power stabilised and the data centre switched back to the main supply. However, the power fluctuation caused a fault in the cooling system which allowed temperatures inside the data centre to rise.

During a technical review following the power fluctuations, the data centre facility operators detected a fault in the cooling control system that had not restarted as expected. The cooling was first turned back on manually which halted the rise in temperatures and began cooling the server rooms. The cooling vendor was called on-site for immediate emergency repairs and the cooling control system was switched back to automatic operation.

Despite the cooling having been re-enabled, the temperatures within the data centre had exceeded safe levels. The increase in temperature reduced the effectiveness of the server hardware cooling and caused systems to shut down to protect themselves from overheating.

Once the temperatures returned back within safe operating levels, we immediately began restoring all services. After all cloud servers had been restarted and services returned to normal operating status, we continued to closely monitor the results until we were confident to call the incident resolved.

We are truly sorry for the loss of service.

Summary of the timeline

10:58: Loss of grid supply was detected at the data centre facility

11:05: First check on the systems by on-site personnel

11:13: Emergency on-call informed and requested to come to the site

11:33: We identified the issue within our services which affected multiple nodes

11:35: Arrival of several site teams

11:40: Technicians attending the site performed checks on all systems

12:00: Rising temperatures detected

12:17: Cooling vendor called for on-site support

12:30: Cooling switched to manual mode by on-site personnel – temperatures decreasing

13:04: Temperatures returned back in the normal range

13:16: We finished restoring all services and continued monitoring the results

15:00: Arrival of the cooling vendor on site

15:00: The incident was marked resolved

15:20: Repair of the cooling control and switchback to automatic operation

15:30: No further deviations detected on-site and all systems back to normal operation

Posted Mar 10, 2020 - 11:03 UTC

Resolved

This incident has been resolved.

Posted Feb 29, 2020 - 15:00 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Feb 29, 2020 - 13:16 UTC

Update

We are continuing to work on a fix for this issue.

Posted Feb 29, 2020 - 11:54 UTC

Identified

We have identified an issue with our Frankfurt data centre which is affecting multiple nodes. The problem may cause affected cloud servers to become slow or unresponsive.
We are working to bring all affected servers back online and resolve the situation as soon as possible.

Posted Feb 29, 2020 - 11:33 UTC

This incident affected: DE-FRA1: Germany (DE-FRA1: Virtualization Hosts).