Setting up OpenVZ live migration with zero downtime
In 2013, Amazon suffered a 40-minute downtime, that resulted in around $5M loss for the company. A website that is down for even a few minutes can cause reputation damage and drive away customers. To maintain business credibility, business owners prefer to purchase their accounts from hosting providers who deliver maximum uptime guarantee.
A server downtime is unavoidable in cases such as server maintenance, capacity management, hardware upgrade, etc. But with proper planning, the business downtime of the accounts in that server can be avoided. In a server virtualization solution, to prevent the accounts in a server from being inaccessible during maintenance windows, we live-migrate those accounts from that server to another.
Using live migration, we perform efficient capacity management among the servers in a virtualization system. This helps to resolve the performance overhead of servers without buying additional hardware. Live migration is also done when we want to distribute resources in a server virtualization system for everyone to get their fair share.
In live migration, we transfer a running virtual machine between different physical machines without shutting down the VM. The state of a virtual machine at an instant – memory, storage, and network connectivity settings – is transferred from the host server to the destination. Though it sounds like an easy task, if performed without proper capacity plan and performing the required configuration settings, the migration process can lead to an additional overhead or can fail and not serve the purpose.
Recently we were contacted by a VPS provider who wanted to ensure 99.95% uptime for the VPSs. Hosted in OpenVZ containers, these VPSs faced downtime whenever a hardware maintenance was done in the server. To maintain the VPS uptime, we performed live migration of these VPSs to other less-loaded servers in the same server virtualization solution.
To ensure that the migration happens without any hindrance, we made a capacity plan first. We identified the resource requirement of each VM and resource availability in the other servers. We then did a mapping for each VM to its destination server. This was done in such a way that there is an overall optimal resource allocation that doesn’t affect the performance of the server virtualization system.
We also decided between the two migration methods – 1. ‘vzmigrate’ and 2. ‘vzdump’ – for each container, after analyzing its data volume, traffic and the network speed. For small sized VMs, we did live migration using a single command ‘vzmigrate’. But for huge-sized VMs and with low network speed, we switched to a more reliable and step-by-step transfer using ‘vzdump’.
Live migration, though, is a simple one-step process, there are some probable failure points – compatibility issues leading to migration failure, data sync failure and website downtime due to IP conflicts. Here is a walk-through of how we migrated container 101 from the host server and restored it in the name 103 in the destination server, avoiding these failure points.
1. Avoiding migration failure due to compatibility issues –
To avoid a failure in migration due to compatibility issues between the host and destination servers, we did these pre-migration configuration updates in both the servers.
a. The OpenVZ migrate tool uses ssh and rsync tools to copy over the container from one server to another. So we first configured the two servers to communicate over ssh keys.
b. Before migrating, we confirmed that the kernel versions were compatible in both servers.
c. ‘rsync’ tool was installed in the servers to copy over the VPS.
d. A backup of the VM to be migrated was taken, to be on the safer side.
e. We synchronized the system time on the source and destination servers using NTP to avoid application sync errors.
f. We then performed a preliminary check to identify failure points that may happen in that live migration.
The host and destination servers were then configured to enable the migration to happen.
a. Configuring host and destination servers
The ssh key was generated in the host server and the public key was copied over to destination server.
[root@host ~]# ssh-keygen -t rsa Your identification has been saved in /root/.ssh/id_rsa. Your public key has been saved in /root/.ssh/id_rsa.pub. The key fingerprint is: cf:70:02:f5:84:8c:3d:41:3b:9a:29:60:00:a1:20:2a The key's randomart image is: +--[ RSA 2048]----+ |B. =+o. | |=. ..=+ | |E o . o.. | |.. . = . | | . + S . | | . * | | o | | | | | +-----------------+ [root@host ~]# scp .ssh/id_rsa.pub root@dest:./id_rsa.pub
In destination server, the following commands were executed to add the host server to the authorized list of ssh agents. The public key of the host server was added to the ‘authorized_keys’ list in the destination server.
root@dest:~# mv id_rsa.pub /root/ root@dest:~# cd /root/.ssh/ root@dest:~/.ssh# touch authorized_keys2 root@dest:~/.ssh# chmod 600 authorized_keys2 root@dest:~/.ssh# cat ../id_rsa.pub >> authorized_keys2
After configuring the servers, an ssh session was established from host server to destination server, to confirm the connectivity.
[root@host ~]# ssh -2 -v root@dest debug1: Entering interactive session. debug1: client_input_global_request: rtype firstname.lastname@example.org want_reply 0 debug1: Sending environment. debug1: Sending env LANG = en_IN Welcome to Ubuntu 15.10 (GNU/Linux 4.2.0-25-generic x86_64)
Once the connection was successfully confirmed, we went ahead to perform the pre-migration check.
b. Pre-migration check
To verify that the VPS can be migrated live using ‘vzmigrate’, we did a preliminary check to identify failure points, if any.
[root@host dump]# vzmigrate --check-only --live dest.cpiv.com 101 Locked CT 101 Checking live migration of CT 101 to dest.cpiv.com
c. Migration process
After confirming that there are no failure points, the VPS was migrated from host to destination server using ‘vzmigrate’ command.
[root@host dump]# vzmigrate --live -r no --keep-dst dest.cpiv.com 101 Locked CT 101 Starting live migration of CT 101 to dest.cpiv.com Preparing remote node Initializing remote quota Syncing private Syncing 2nd level quota Turning quota off Cleanup
By default, once the migration process was completed, the container private area and configuration file would be deleted from the host server. As we did not want the container to be deleted from the host immediately, we used the -r option to retain the VPS files in the host server. The option ‘keep-dst’ was used to avoid re-syncing container private area in case some error happens during first migration attempt, as it could corrupt the configuration settings.
The VPS started functioning in the destination server without issues.
2. Avoiding data sync failure
In OpenVZ live migration, a number of background stages are involved – freezing the container, copying the container’s state to a dump file, restoring it and restarting the container. But the live migration using single ‘vzmigrate’ command did not work properly in certain cases. Due to network issues or huge size of data, delays or failures can happen in any of these stages. For instance, in the case of VPSs with huge databases, the migration took a lot of time, and the process got hung.
In cases where ‘vzmigrate’ failed, we used a more reliable migration method with ‘vzdump’ utility. As vzdump is not included in the Openvz repository by default, SolusVM repository was configured and then the utility was installed in both source and destination servers. Then we created dump file in the host server, transferred it to the destination server and restored the VPS there. Here is how we did it.
a. Creating VPS dump
From the source server, first the dump file is generated with the container data, using the vzdump command.
[root@host dump]# vzdump --compress 101 INFO: Starting new backup job - vzdump --compress 101 INFO: Starting Backup of VM 101 (openvz) INFO: status = CTID 101 exist mounted running WARN: online backup without stop/suspend/snapshot WARN: this can lead to inconsistent data INFO: creating archive '/vz/dump/vzdump-101.dat' (/vz/private/101) INFO: Total bytes written: 810557440 (774MiB, 6.6MiB/s) INFO: file size 211MB INFO: Finished Backup of VM 101 (00:01:59)
The OpenVZ container files were compressed and stored in the dump folder in the source server.
[root@host dump]# ls vzdump-101.log vzdump-101.tgz
b. Restoring the VPS
The compressed files were then copied to the destination server using rsync. In the destination server, the vzdump utility was used to restore the VPS.
root@dest:~# vzdump --restore vzdump-101.tgz 103 INFO: restore openvz image 'vzdump-101.tgz' using ID 103 INFO: extracting archive 'vzdump-101.tgz' INFO: Total bytes read: 676341760 (646MiB, 30MiB/s) INFO: extracting configuration to '/etc/vz/conf/103.conf' INFO: restore successful
After the new VPS was restored and confirmed to be working fine, we started the VPS in the destination server and shut down the one in the host server.
3. Ensuring zero downtime for website during migration
We identified that there was a chance that IP conflict would happen if the VPS in the host server and destination server ran on the same IP. To avoid that, we changed the IP address and the hostname of the VPS in the destination server. The DNS records were configured to reflect the new IP address after reducing their TTL values. This was done to ensure that the changes were propagated without much delay.
The restored VPS was then successfully started in the destination server and the data were cross-checked for adequacy.
root@dest:~# vzctl start 103 Starting container... Opening delta /vz/private/103/root.hdd/root.hdd Adding delta dev=/dev/ploop53009 img=/vz/private/103/root.hdd/root.hdd (rw) Mounting /dev/ploop53009p1 at /vz/root/103 fstype=ext4 data='balloon_ino=12,' Container is mounted Setting CPU units: 1000 Container start in progress...
After confirming that the new VPS and its websites were functioning properly in the new server with the new IP address, we suspended the container from the host server.
Whenever we performed a planned migration, the transfer was done during off-peak hours to ensure minimal changes of files in the website. But there were still chances that a few files get updated during the migration process. An additional syncing of files was also done from the source to destination servers, to ensure that all files are up-to-date in the destination server and any changes made in source server during the transfer were also copied over.
In this post we saw how we used live migration to ensure that business uptime of a VPS did not get affected during server maintenance tasks. We prepared the migration plan, did work around for the probable failure points and migrated the VPSs to different servers, with zero downtime involved. Bobcares helps VPS providers, data centers and web hosts deliver industry-standard services through custom configuration and preventive maintenance of server virtualization solutions.