In this write up, we will explain the incident in more detail and also how we went about the operation to restore service. At the time of publishing, we are still monitoring the situation and making sure all affected customer cloud servers are back online.
Overview of our infrastructure
As it is generally known, our infrastructure is built in such a manner where customers’ storage resources are separated from the compute resources by an Infiniband fabric. This setup has numerous benefits, for example our capability of storing the customer data on two physically separate storage backends for improved redundancy. Also, this fabric enables storage resources to be accessed from any compute host in the same datacenter.
This setup is extremely valuable when customers can freely scale their resources up or down and the storage resources do not have to be moved to a new compute host due to this operation. The storage remains always accessible inside the same datacenter from any other compute host that offers the new set of resources the best performance.
Overview of the HDD incident
Today at around 7.00 EET we saw an abnormal amount of signals pointing to an error situation from one of our HDD backends. The server was fully functional and overall, just handful of the storage resources residing on that storage backend were discovered to be erroneous at first and also affecting resources on a number of compute hosts.
The situation degraded when a kernel-module, that is run on the compute hosts, which manages the Infiniband fabric traffic did not cope with the new situation. Again, a great majority of the storage resources on the backend were fully functional so the issue was not very widespread. The kernel-module caused a so-called deadlock-situation, where the resolution requires a server reboot.
As explained earlier, a single storage backend can have storage resources that are used by cloud servers on multiple compute hosts. The deadlock-situation combined with great flexibility of resources turned against us as the incident eventually affected about 90% of our compute hosts, but only 3,5% of our customers’ cloud servers in our Helsinki datacenter. If we would have resolved the situation in a traditional manner, we would have had to restart all of these physical compute hosts and thus disrupted the operations of thousands of perfectly functional cloud servers.
Building a solution
During the day, our infrastructure and development teams worked on a way to resolve situation where we could isolate fully functional cloud servers for most of our customers and still resolve the operations for the affected customers.
We finally had a working piece of automation in place by 14.30 EET and we began to clean up the situation we were in. While we are sending this, some final and manual clean up operations continue to take place to return all affected cloud servers to a fully functional status.
Going forward and our apologies
We are terribly sorry and would like to apologise wholeheartedly for the extended loss of service. Not choosing the easy way and disrupting the fully operational cloud servers on the 90% of the compute machines forced us to come up with a more sophisticated solution that caused all the extra work. However, we do believe all the non-affected customers also value us not disrupting their operations.
We are also happy to highlight that data integrity is always and forever one of our top priorities. We do not want to run into quick fixes and endanger data integrity. Today’s fix followed that philosophy and it was the one with the least risk regarding data integrity and we are happy to state that the solution built in place does not cause data loss at all.
It is still not completely clear why the deadlock-situation occurred. We will issue a proper analysis into the hardware and software components involved. Operations are currently offered through the redundant pair and thus no further customer impact is expected.
We will also analyse why the kernel module in our Linux version caused the deadlock situation. The module is included in the default Linux kernel and thus is not developed by our staff. We will try to resolve this with the original developer and gain insight into how we can improve the kernel module to handle these situations better.
If you feel you are still affected by the situation, contact us immediately at firstname.lastname@example.org
if you have not reached out previously.
Joel Pihlajamaa, CTO, Founder &
Antti Vilpponen, CEO