Summary of the FI-HEL1 and US-SJO1 incident on April 10th, 2019
We sincerely apologise for the loss of service some of our users in FI-HEL1 and US-SJO1 had to endure yesterday, April 10th 2019, starting at 16:35 (UTC+0).
All affected cloud servers have been brought back online and all services are operating normally. Although there was an interruption of service, at no point was there any risk of data loss for our users. We are currently helping our users with possible secondary issues caused from the restart of their cloud servers.
While we realise this post mortem will not excuse us for the loss of service, we do wish it explains our actions and what we will do to avoid such issues in the future.
The UpCloud Operations team was debugging an issue with a faulty compute node and was operating on the configuration database. At 16:30 (UTC+0), a database query which was intended to affect a single compute node, was entered incorrectly and affected multiple compute nodes. This erroneous command execution was recognised instantly and the operational execution was halted manually by our operations team engineer. By this time the change in the configuration database had caused the automation system to perform unwanted operations on our production servers. This resulted in the detachment of attached storages in affected cloud servers, rendering them non-operational. The overall impact was limited to 11% of cloud servers, even though this affected our largest data centre FI-HEL1 and our newest data centre US-SJO1.
Detaching a significant portion of storages from these affected cloud servers required a full restart of the compute nodes. We initiated the restart process for the compute nodes which were running the affected cloud servers at 18:49 (UTC+0), which in some cases also affected operational cloud servers.
UpCloud’s systems are designed to support mass-operations to modify, add or remove capacity/services to support our users growth. While this is an operational procedure which we have relied on to operate our systems, we did not have the necessary safety checks in place to limit the impact of human error.
Immediately following the incident, we are making at least the following changes to our automation system and procedures. We will also further analyse this incident with time and make additional improvements to best avoid similar incidents in the future.
These changes will include reducing the impact of human error and speeding up the recovery from a severe outage. While we recognise that the possibility of human error is unavoidable, we will develop safeguards to confine the failure domain for such an incident. Continuous work has been done to make cloud server recovery and startup faster. These improvements will now be prioritised to further shorten the time required to bring entire compute nodes operational.
Users are eligible for compensation based on the service level agreement. Please reach out to our customer support if you need assistance.
Summary of the timeline (UTC+0):
16:35: An operations team engineer executes an incorrect command, affecting multiple compute nodes in FI-HEL1 and US-SJO1. The operations team engineer recognises the incorrect command immediately and halts the execution manually.
16:42: The first public status update is sent to our users. The whole operations team is alerted to attend the recovery process of affected cloud servers.
16:54: A second status update is sent to users, acknowledging the outage.
17:37: A third status update is sent to users, informing that the service recovery is underway.
18:28: The decision was made to initiate a restart process for the affected compute nodes. This decision was made after an investigation of the outage, affected cloud servers and careful validation of the recovery options to bring the affected cloud servers online as fast as possible.
18:49: A fourth status update is sent to users, informing that a root cause has been identified, and that we are in the process of getting the affected cloud servers back online.
19:37: A fifth status update sent to users, informing that the service recovery is underway.
21:09: A sixth status update sent to users. The majority of affected cloud servers in US-SJO1 have been recovered into operational status. 40% of all affected cloud servers have been brought into operational status. Further guidance for users on how to restart the cloud servers themselves was also posted.
01:01: A seventh status update is sent to users. 70% of all affected cloud servers have been brought into operational status. The API is also confirmed to be fully functional in all zones.
02:35: An eighth status update is sent to users. 95% of all affected cloud servers have been brought into operational status. Work continues to start the remaining cloud servers.
03:42: The ninth and final update is sent to users. All affected cloud servers have been brought into operational status.
Again, we are truly sorry for the loss of service.
Chief Technology Officer, Founder