Setting up Percona high availability database cluster
MySQL is by far the most popular database used for web applications. It is a very easy database to use, but it doesn’t have the best uptime record. Many database issues such as corrupted tables, resource limits, or replication errors often lead to website downtime.
Many of these database issues can be avoided by proactive database maintenance. However, issues like hard disk crash, or network errors can still happen, leading to database corruption. An effective hedge against such issues is to use a database cluster.
Database high availability using Percona XtraDB Cluster (PXC)
High availability for databases is achieved through clustering. When a database is replicated into several physical servers, even if one server goes down, the website can continue to function using the other servers.
Why MySQL clustering isn’t reliable
MySQL supports clustering. The most popular method is called “master-master” replication, where a web application can write to any database server in the cluster, and the changes will be immediately replicated in all other servers. However, MySQL clusters are known to produce inconsistent data (data that’s different in each server) when the cluster nodes are unable to talk to each other.
So, even though MySQL clusters deliver high availability, the data couldn’t be relied upon.
PXC uses Galera technology to avoid data inconsistency
It is this problem that was fixed by a MySQL clustering solution known as “Galera”. It performs frequent consistency checks, so that if a database node is unable to talk to others, it won’t accept any new updates, and will remove itself from the cluster.
Percona XtraDB Cluster (PXC) combined Galera technology and MySQL’s InnoDB engine to produce a fast, highly available cluster, which is why we recommend our customers to use PXC for their websites.
Setting up a Percona XtraDB Cluster
The number of nodes in a PXC cluster, the connection limits on each node, etc. depend on the performance requirements of each business. However, there are a few design and configuration decisions that are applicable on a wide variety of web infrastructures. Let’s take a look at those.
Designing a database service architecture
Before a Percona cluster can be setup, it must be decided how the web servers will communicate with the database servers. There are many architectures to evenly distribute database queries to multiple database servers. Some websites connect each web server to each database server (many to many links), while some use a central load balancer to distribute the load.
We manage websites of several online businesses. For these websites, we recommend using a central load balancer where all traffic is channeled through a central load balancer. This design minimizes the load on the database servers (fewer health checks), and enables better monitoring and control.
Configuring PXC for optimal performance and security
Now, coming to the PXC setup in itself, there are several performance, stability and security considerations involved in deciding the cluster configuration.
Here are a few common settings we configure in the PXC servers that we manage for our customers:
1. Using odd number DB nodes to avoid database corruption
To split the database load, we can use an even number of nodes (like 2) or odd number (eg 3). We recommend using odd-number cluster because it gives a degree of protection against “split-brain” condition. For eg., in a 2-node cluster, if 1 node is out of sync, there is no way to decide who has the latest data. If there are 3 nodes, 2 nodes can agree what is the most recent data.
2. Enabling binary logging for easy fault recovery
To make the cluster robust against failures, we setup a system called “binary logging” in the servers we manage. This enables rolling back a change (like dropping a table, deleting important data, etc.) very easy. This is not enabled by default in Galera clusters, but we’ve seen from experience that accidents like this do happen for customized WordPress websites.
3. Custom firewall configuration to disallow external access
An important consideration in setting up the cluster is security configuration. Galera does not work well with SELinux and default IPtables configuration. In the servers we manage for our customers, we custom configure IPtables so that only DB servers and load balancers can talk to each other, and all external IPs are blocked.
4. Setting up auto-commits to fix update conflicts
In a multi-master cluster, there are situations where the same database row is updated by two users at the same time (known as parallel updates). In these cases, XtraDB approves the first query, while the second query is blocked, leading to a “deadlock error”. We custom configure cluster settings so that any query in deadlock status will be re-attempted multiple times. This is done using the parameter “wsrep_retry_autocommit”.
Setting up high availability of load balancer
With a Percona cluster with 3 or more nodes, a website downtime can be avoided even if 2 database nodes go down. But, if the load balancer crashes for some reason, all 3 nodes would be become inaccessible. To negate this single-point-failure, we use a “backup” load balancer to immediately take over if the main load balancer fails for some reason.
Use a backup load balancer to avoid single point failure
To implement the system, first two load balancer servers are setup with the exact same configuration. Configuration settings and other data are synched to keep both servers current, so that at any point of time, either of them can work as the primary load balancer.
Now, to enable auto-switching to backup load balancer, we need a software that can detect failure of the main server, and bring the backup server online.
KeepAliveD uses virtual IPs (aka floating IPs) to bring backup server online
To enable auto-switching, we use a tool called KeepAliveD. The IPs of the load balancers (both “main” and “backup”) are configured to be controlled by KeepAliveD. In both load balancer servers, KeepAliveD program keeps in touch with each other. If the KeepAliveD program in “backup” HAProxy detects that the “main” server is down, it’ll assume the ownership of the IP, and bring it online in the backup server.
This concept is called “Floating IP” – in that the IP can “float” to another server when the main server goes down. By employing this technique, we achieve virtually 100% database uptime for our customers.
Business websites need an infrastructure that is highly available, while being cost effective. In this solution we explained how we use Percona XtraDB Cluster (PXC) and KeepAliveD to achieve high availability for databases. By choosing open source load balancing solutions, we were able to lower the infrastructure costs, while ensuring high uptime.
Bobcares helps business websites of all sizes achieve world-class performance and uptime, using tried and tested website architectures. If you’d like to know how to make your website more reliable, we’d be happy to talk to you.