FI-HEL1: Issue with a HDD backend
Incident Report for UpCloud
Resolved
This incident has been resolved.
Posted over 1 year ago. Feb 26, 2016 - 19:38 UTC
Monitoring
In this write up, we will explain the incident in more detail and also how we went about the operation to restore service. At the time of publishing, we are still monitoring the situation and making sure all affected customer cloud servers are back online.


Overview of our infrastructure
As it is generally known, our infrastructure is built in such a manner where customers’ storage resources are separated from the compute resources by an Infiniband fabric. This setup has numerous benefits, for example our capability of storing the customer data on two physically separate storage backends for improved redundancy. Also, this fabric enables storage resources to be accessed from any compute host in the same datacenter.

This setup is extremely valuable when customers can freely scale their resources up or down and the storage resources do not have to be moved to a new compute host due to this operation. The storage remains always accessible inside the same datacenter from any other compute host that offers the new set of resources the best performance.

Overview of the HDD incident
Today at around 7.00 EET we saw an abnormal amount of signals pointing to an error situation from one of our HDD backends. The server was fully functional and overall, just handful of the storage resources residing on that storage backend were discovered to be erroneous at first and also affecting resources on a number of compute hosts.

The situation degraded when a kernel-module, that is run on the compute hosts, which manages the Infiniband fabric traffic did not cope with the new situation. Again, a great majority of the storage resources on the backend were fully functional so the issue was not very widespread. The kernel-module caused a so-called deadlock-situation, where the resolution requires a server reboot.

As explained earlier, a single storage backend can have storage resources that are used by cloud servers on multiple compute hosts. The deadlock-situation combined with great flexibility of resources turned against us as the incident eventually affected about 90% of our compute hosts, but only 3,5% of our customers’ cloud servers in our Helsinki datacenter. If we would have resolved the situation in a traditional manner, we would have had to restart all of these physical compute hosts and thus disrupted the operations of thousands of perfectly functional cloud servers.

Building a solution
During the day, our infrastructure and development teams worked on a way to resolve situation where we could isolate fully functional cloud servers for most of our customers and still resolve the operations for the affected customers.

We finally had a working piece of automation in place by 14.30 EET and we began to clean up the situation we were in. While we are sending this, some final and manual clean up operations continue to take place to return all affected cloud servers to a fully functional status.

Going forward and our apologies
We are terribly sorry and would like to apologise wholeheartedly for the extended loss of service. Not choosing the easy way and disrupting the fully operational cloud servers on the 90% of the compute machines forced us to come up with a more sophisticated solution that caused all the extra work. However, we do believe all the non-affected customers also value us not disrupting their operations.

We are also happy to highlight that data integrity is always and forever one of our top priorities. We do not want to run into quick fixes and endanger data integrity. Today’s fix followed that philosophy and it was the one with the least risk regarding data integrity and we are happy to state that the solution built in place does not cause data loss at all.

It is still not completely clear why the deadlock-situation occurred. We will issue a proper analysis into the hardware and software components involved. Operations are currently offered through the redundant pair and thus no further customer impact is expected.

We will also analyse why the kernel module in our Linux version caused the deadlock situation. The module is included in the default Linux kernel and thus is not developed by our staff. We will try to resolve this with the original developer and gain insight into how we can improve the kernel module to handle these situations better.

If you feel you are still affected by the situation, contact us immediately at support@upcloud.com if you have not reached out previously.


Joel Pihlajamaa, CTO, Founder &
Antti Vilpponen, CEO
Posted over 1 year ago. Feb 26, 2016 - 17:07 UTC
Update
We have begun to bring back affected customer cloud servers successfully. We will change the status of this incident when all affected servers have been brought back online. Again, if you run into problems, do not hesitate to contact support@upcloud.com. We will be issuing a post mortem later on to affected customers, as the incident has been closed.
Posted over 1 year ago. Feb 26, 2016 - 13:19 UTC
Update
We are progressing with building a solution and we are fully focused in resolving the situation as soon as possible. We hope to have the last erroneous servers online soon, but we cannot unfortunately give an exact ETA on this as of yet.
Posted over 1 year ago. Feb 26, 2016 - 10:33 UTC
Update
The issue is affecting only a dozen or so customers anymore and we are in the process of restoring service to all involved. We would kindly ask you send us your server UUID to support@upcloud.com if you feel your cloud server is still not fixed. We thank you for your co-operation and apologise for the delayed loss of service, once again.
Posted over 1 year ago. Feb 26, 2016 - 08:43 UTC
Update
Our engineers are still working on the issue to resolve it as fast as possible. We will update with more information as it arrives.
Posted over 1 year ago. Feb 26, 2016 - 07:47 UTC
Update
We are still working to bring back affected servers, we apologise for the extended loss of service.
Posted over 1 year ago. Feb 26, 2016 - 06:58 UTC
Update
We have identified the issue and are building a solution to restore access to all affected servers.
Posted over 1 year ago. Feb 26, 2016 - 06:27 UTC
Identified
We are investigating an issue with a HDD backend in Helsinki. The issue freezes the server completely or slows it down. We will report back as soon as possible.
Posted over 1 year ago. Feb 26, 2016 - 05:52 UTC