Setting up LXC live migration to minimize business downtime
In 2010, Virgin Blue Airlines reported a $20 million dip in profit due to an outage in its online reservation system. Hardware failure was identified as the reason for the outage, but it caused severe damage to Virgin Blue’s reputation, customer loyalty and profit.
Every online business is affected by a downtime at one time or another for reasons as varied as software bugs, hardware issues or human errors.
Mitigation of downtime due to hardware errors is perhaps the most researched topic in data centers. IaaS providers, VPS providers and Cloud providers use strategies such as redundant hardware, fail-over systems and scheduled downtimes to minimize impact to businesses.
For instance, in VPS hosting, it is common for servers to be equipped with RAID hard disks, dual network cards, etc. as a hedge against failure. But, even in such a system, hardware replacements are sometimes required, which can lead to business downtime.
In these situations, the solution is to transfer all virtual servers in a host to another system, and then power down the host for maintenance. The technology that enables such a transfer, without shutting down the virtual servers is called “live migration”.
All leading server virtualization systems support live migration, but each system have its own set of “quirks” or dependencies, that need to be satisfied before a successful live migration can be done.
Recently, we assisted a VPS provider setup live migration in their LXD/LXC server virtualization system. The VPS provider guaranteed 99.95% uptime, and used redundant hardware such as dual-power-supply, dual network card and RAID disks to mitigate the risk of hardware failure.
However, during scheduled maintenance, the VPSs had to be manually moved (shut-down, move, restart) to another server, which was affecting the SLA guarantee.
This is the story of how we setup live migration for an LXD/LXC VPS system.
LXC live migration setup overview
In LXC, there is no “single click” provision to migrate a container live from one server to another. For live migration to work, we fixed a few dependencies in LXC:
- Clustering servers with identical processors – Live migration works in LXC only between servers with identical CPU architecture. So, we clustered servers into 2 clusters – one with AMD processors and the other with Intel processors.
- Enabling CRIU support in containers – LXC used a Linux feature called CRIU (Checkpoint/Restore In Userspace) for live migration. So, we enabled each container with CRIU support, and resource settings were optimized to make sure the container worked fine in another server.
- Configuring hosts to listen on TCP ports – By default, LXD servers listen on a unix socket, which cannot be accessed by external programs. For LXC live migration to work, LXD was linked to a public TCP port.
Once the dependencies were fixed, live migration worked flawlessly in the LXD servers. Here I’ll go through an example of how live migration was implemented to migrate a container named “VPS-silver-2752” from server “VPS-host-04” to “VPS-host-05”.
[Don’t lose your sleep over server stability! Our 24/7 server experts will maintain your servers secure and lean, at affordable pricing. ]
Bobcares provides Outsourced Hosting Support and Outsourced Server Management for online businesses. Our services include Hosting Support Services, server support, help desk support, live chat support and phone support.