Significant user impact in FI-HEL1 and US-SJO1.

Incident Report for UpCloud

Postmortem

Summary of the FI-HEL1 and US-SJO1 incident on April 10th, 2019

We sincerely apologise for the loss of service some of our users in FI-HEL1 and US-SJO1 had to endure yesterday, April 10th 2019, starting at 16:35 (UTC+0).

All affected cloud servers have been brought back online and all services are operating normally. Although there was an interruption of service, at no point was there any risk of data loss for our users. We are currently helping our users with possible secondary issues caused from the restart of their cloud servers.

While we realise this post mortem will not excuse us for the loss of service, we do wish it explains our actions and what we will do to avoid such issues in the future.

The UpCloud Operations team was debugging an issue with a faulty compute node and was operating on the configuration database. At 16:30 (UTC+0), a database query which was intended to affect a single compute node, was entered incorrectly and affected multiple compute nodes. This erroneous command execution was recognised instantly and the operational execution was halted manually by our operations team engineer. By this time the change in the configuration database had caused the automation system to perform unwanted operations on our production servers. This resulted in the detachment of attached storages in affected cloud servers, rendering them non-operational. The overall impact was limited to 11% of cloud servers, even though this affected our largest data centre FI-HEL1 and our newest data centre US-SJO1.

Detaching a significant portion of storages from these affected cloud servers required a full restart of the compute nodes. We initiated the restart process for the compute nodes which were running the affected cloud servers at 18:49 (UTC+0), which in some cases also affected operational cloud servers.

UpCloud’s systems are designed to support mass-operations to modify, add or remove capacity/services to support our users growth. While this is an operational procedure which we have relied on to operate our systems, we did not have the necessary safety checks in place to limit the impact of human error.

Immediately following the incident, we are making at least the following changes to our automation system and procedures. We will also further analyse this incident with time and make additional improvements to best avoid similar incidents in the future.

These changes will include reducing the impact of human error and speeding up the recovery from a severe outage. While we recognise that the possibility of human error is unavoidable, we will develop safeguards to confine the failure domain for such an incident. Continuous work has been done to make cloud server recovery and startup faster. These improvements will now be prioritised to further shorten the time required to bring entire compute nodes operational.

Users are eligible for compensation based on the service level agreement. Please reach out to our customer support if you need assistance.

Summary of the timeline (UTC+0):

16:35: An operations team engineer executes an incorrect command, affecting multiple compute nodes in FI-HEL1 and US-SJO1. The operations team engineer recognises the incorrect command immediately and halts the execution manually.

16:42: The first public status update is sent to our users. The whole operations team is alerted to attend the recovery process of affected cloud servers.

16:54: A second status update is sent to users, acknowledging the outage.

17:37: A third status update is sent to users, informing that the service recovery is underway.

18:28: The decision was made to initiate a restart process for the affected compute nodes. This decision was made after an investigation of the outage, affected cloud servers and careful validation of the recovery options to bring the affected cloud servers online as fast as possible.

18:49: A fourth status update is sent to users, informing that a root cause has been identified, and that we are in the process of getting the affected cloud servers back online.

19:37: A fifth status update sent to users, informing that the service recovery is underway.

21:09: A sixth status update sent to users. The majority of affected cloud servers in US-SJO1 have been recovered into operational status. 40% of all affected cloud servers have been brought into operational status. Further guidance for users on how to restart the cloud servers themselves was also posted.

01:01: A seventh status update is sent to users. 70% of all affected cloud servers have been brought into operational status. The API is also confirmed to be fully functional in all zones.

02:35: An eighth status update is sent to users. 95% of all affected cloud servers have been brought into operational status. Work continues to start the remaining cloud servers.

03:42: The ninth and final update is sent to users. All affected cloud servers have been brought into operational status.

‌

Again, we are truly sorry for the loss of service.

Joel Pihlajamaa

Chief Technology Officer, Founder

Posted Apr 11, 2019 - 09:08 UTC

Resolved

All affected cloud servers have been brought back online. We sincerely apologise for the long recovery period and impact this has had on our users. Please reach out to support if you have any further questions.

We will be updating users with a post mortem on the incident.

Posted Apr 11, 2019 - 03:42 UTC

Update

We have brought back 95% of the affected cloud servers in FI-HEL1 and continuing work on the remaining, do reach out to support if you have any questions.

Posted Apr 11, 2019 - 02:35 UTC

Update

API is fully operational now.
We are continuing to roll out fixes for FI-HEL1, do reach out to support if you have any questions.

Posted Apr 11, 2019 - 01:01 UTC

Update

We brought back into operational status almost all of the affected cloud servers in US-SJO1. In total, we have recovered almost 40% of affected cloud servers in both locations. We understand the significant impact this has for our users and are doing our utmost to bring back all cloud servers as soon as possible.

If cloud servers which are operational (green status) in the control panel continue to face issues, we advice our users to log in through VNC or console access to resolve any potential OS-level problems due to the sudden error state received.

We will continue to communicate the progress here.

Posted Apr 10, 2019 - 21:09 UTC

Update

We are progressing with the recovery efforts. Many of the affected cloud servers have already been restarted. We continue relentlessly with the work and will keep you updated here. You are always welcome to reach out to our support in case you require assistance.

Posted Apr 10, 2019 - 19:37 UTC

Identified

The root cause has been identified and we are working to get the affected cloud servers online.

To make sure we get all affected cloud servers back online as soon as possible, we are required to restart the affected host machines. This might in some cases also affect operational cloud servers to speed up recovery of the infrastructure.

Posted Apr 10, 2019 - 18:49 UTC

Update

We are still working on the issue. Affected customer servers in FI-HEL1 and US-SJO1 will be restarted and returned to running state.

Posted Apr 10, 2019 - 17:37 UTC

Update

The incident has caused some VMs to stop temporarily. We are working to restore these in working condition.

Posted Apr 10, 2019 - 16:54 UTC

Monitoring

We are aware of and currently working on fixing issues with API and deployments in all zones. We will update as soon as resolved.

Posted Apr 10, 2019 - 16:42 UTC

This incident affected: FI-HEL1 - Finland (FI-HEL1: Virtualization Hosts), US-SJO1 - USA (US-SJO1: Virtualization Hosts), and API.