We sincerely apologise for the loss of service some of our users in FI-HEL1 had to endure today, November 6th 2019, starting at 10:35 (UTC+0).
All affected cloud servers have been brought back online and all services are operating normally. Although there was an interruption of service, at no point was there any risk of data loss for our users.
While we realise this post mortem will not excuse us for the loss of service, we do wish it explains our actions and what we will do to avoid such issues in the future.
On Wednesday 6th of November 2019 at 10:35 UTC, we became aware of a partial outage in our FI-HEL1 data centre. The initial signs indicated an issue with networking and part of the storage backend, the latter of which was causing the affected cloud servers to report storage in a read-only state or become unresponsive.
While investigating the incident, we soon identified the root cause of the issues to be due to a partial power feed failure at the data centre facilities.
Momentarily after the incident began, the data centre operator staff reported that the secondary power feed had experienced a loss of power during a routine equipment test. We were informed that all power loads had been transferred to the main feed and the situation was being stabilised.
The power outage was due to a human error by our data centre provider’s staff members during the data centre power feed testing. This resulted in approximately 9 minutes of a power outage to all data centre customers on the same power feed including a certain part of our storage infrastructure.
Once the data centre operator had restored power, we immediately began the process of bringing everything back to operational.
During the power outage, we identified a specific storage network and a small number of compute hosts that had a faulty power redundancy, and thus were powered down during the outage. This unfortunately also affected a small portion of our FI-HEL1's users cloud servers.
As such, affected cloud servers could have experienced operating system level issues caused by the momentary loss of connectivity to the storage backend.
We recommend anyone still experiencing problems such as the server not responding or showing operating system or storage errors to issue forced shutdown command via UpCloud control panel or API and start the server again.
To avoid a similar situation in the future, we have begun to evaluate the power feed distribution infrastructure in all of our data centres.
Summary of the timeline (UTC+0):
10:35: We become aware of a partial outage affecting FI-HEL1 users.
10:41: The first public status update is sent to our users. The whole operations team is alerted to attend and investigate the outage.
10:52: A second status update is sent to our users, acknowledging the outage and isolated to a faulty power feed.
11:23: A third status update is sent to our users, informing that we are in the process of getting the affected cloud servers back online.
11:47: A fourth status update is sent to our users, informing that all affected cloud servers have been brought into operational status. Further guidance for users on how to restart the cloud servers themselves was also posted.
Again, we are truly sorry for the loss of service.
Team UpCloud