[FI-HEL1]: Partial outage due to power feed disruption

Incident Report for UpCloud

Postmortem

Summary of the FI-HEL1 incident on November 6th, 2019

We sincerely apologise for the loss of service some of our users in FI-HEL1 had to endure today, November 6th 2019, starting at 10:35 (UTC+0).

All affected cloud servers have been brought back online and all services are operating normally. Although there was an interruption of service, at no point was there any risk of data loss for our users.

While we realise this post mortem will not excuse us for the loss of service, we do wish it explains our actions and what we will do to avoid such issues in the future.

On Wednesday 6th of November 2019 at 10:35 UTC, we became aware of a partial outage in our FI-HEL1 data centre. The initial signs indicated an issue with networking and part of the storage backend, the latter of which was causing the affected cloud servers to report storage in a read-only state or become unresponsive.

While investigating the incident, we soon identified the root cause of the issues to be due to a partial power feed failure at the data centre facilities.

Momentarily after the incident began, the data centre operator staff reported that the secondary power feed had experienced a loss of power during a routine equipment test. We were informed that all power loads had been transferred to the main feed and the situation was being stabilised.

The power outage was due to a human error by our data centre provider’s staff members during the data centre power feed testing. This resulted in approximately 9 minutes of a power outage to all data centre customers on the same power feed including a certain part of our storage infrastructure.

Once the data centre operator had restored power, we immediately began the process of bringing everything back to operational.

During the power outage, we identified a specific storage network and a small number of compute hosts that had a faulty power redundancy, and thus were powered down during the outage. This unfortunately also affected a small portion of our FI-HEL1's users cloud servers.

As such, affected cloud servers could have experienced operating system level issues caused by the momentary loss of connectivity to the storage backend.

We recommend anyone still experiencing problems such as the server not responding or showing operating system or storage errors to issue forced shutdown command via UpCloud control panel or API and start the server again.

To avoid a similar situation in the future, we have begun to evaluate the power feed distribution infrastructure in all of our data centres.

‌

Summary of the timeline (UTC+0):

10:35: We become aware of a partial outage affecting FI-HEL1 users.

10:41: The first public status update is sent to our users. The whole operations team is alerted to attend and investigate the outage.

10:52: A second status update is sent to our users, acknowledging the outage and isolated to a faulty power feed.

11:23: A third status update is sent to our users, informing that we are in the process of getting the affected cloud servers back online.

11:47: A fourth status update is sent to our users, informing that all affected cloud servers have been brought into operational status. Further guidance for users on how to restart the cloud servers themselves was also posted.

‌

Again, we are truly sorry for the loss of service.

Team UpCloud

Posted Nov 06, 2019 - 19:03 UTC

Resolved

This incident has been resolved.

Posted Nov 06, 2019 - 14:14 UTC

Monitoring

Virtual servers might have guest OS level issues with their storage due to storage network convergence after the power distribution fault. If your server is not responding or shows operating system or storage errors, please issue a forced shutdown via UpCloud control panel or API and start it again.

Posted Nov 06, 2019 - 13:09 UTC

Update

We have identified a small fraction of compute hosts which failed power redundancy and are bringing affected virtual servers back online.
In addition some other virtual servers might have guest OS level issues with their storages due to storage network convergence after the power distribution fault. If your server is not responding or shows operating system or storage errors, please issue a forced shutdown via UpCloud control panel or API and start it again.

Posted Nov 06, 2019 - 11:47 UTC

Update

DC operator has restored electricity and we're in progress of bringing everything back to operational.

Posted Nov 06, 2019 - 11:23 UTC

Update

We've identified root cause for our storage issue to be partial electricity failure

Posted Nov 06, 2019 - 10:52 UTC

Identified

We have identified issues in our FI-HEL1 data centre. The affected servers may have entered a read-only state or caused the cloud server to become unresponsive.
We are working to restore the services and resolve the situation as soon as possible.

Posted Nov 06, 2019 - 10:41 UTC

This incident affected: FI-HEL1 - Finland (FI-HEL1: Network Connections, FI-HEL1: Storage Backends).