Reliable, scalable DNS – How DNS clustering and Centralized name servers resulted in fast, scalable and fault tolerant DNS service
The mood was upbeat. It was our weekly business review with a web host we support. Server improvements had resulted in zero service downtimes, and zero customer complaints on service reliability. It was time to figure out how to improve the infrastructure even further, and for that, we looked at the support requests.
Support requests give a gold mine of information on how customers are perceiving the service. Happy customers do not open trouble tickets. So, all support requests are a potential pointer to a system or process improvement. So, we started by looking at the top reasons for support tickets.
Sorting by top reasons, we saw that about 17% of new domain registrations were leading to support requests to change their name servers. This caused delays in starting the service. Digging further, we found that this happened because of two reasons:
(1) The current infrastructure had different name servers for different shared servers. An account in shared server “A” will have name servers NA1 and NA2, an account in shared server “B” will have name servers NB1 and NB2, etc.
(2) All new domain registrations by default assigned the name servers of the latest server in the farm to the domains. For eg., A new domain will have name servers NH1 and NH2, even if the reseller ordering the domain was in shared server “B” with name servers NB1 and NB2.
This caused all existing customers who ordered a new domain to open a support request to change the name servers, and with increasing number of accounts, this was on an increasing trend.
So, as the next major infrastructure improvement, we decided to implement a scalable, fault tolerant DNS infrastructure, which would allow us to provision new accounts to any server smoothly, and will enable good service uptime.
After looking at various options, we decided a central DNS system with expandable clustering was the ideal solution. All domains in our server farm would have the same set of name servers, which would be two different servers in two different locations. The number of servers could be increased later. Such a DNS configuration would have the following advantages:
- Fault tolerance – Even if one DNS server went down for some reason, the other server will still respond to DNS queries, and keep the domain online.
- Mail service continuity – Even if the shared server hosting the domain goes down, incoming mails will just try again later, recognizing that only the mail server is down, and not the whole domain.
- Scalable and flexible service provisioning – By assigning each domain the same set of name servers, internal provisioning algorithms could be updated to automatically allocate new domains to any server in the farm depending on domain ownership, server density, free resources, etc.
Since this was a Linux server farm, the solution was implemented using a Bind master-slave configuration. A cron job ran every 5 minutes in each server which looked for any new domain zone file added or updated. Any changes were immediately relayed to a central Bind master server which updated the central configuration file. Every minute the secondary Bind slave server queried the master for changes, and applied those changes in its configuration file. The Bind master and Bind slave were located in two continents giving them enough geographical separation.
One month after the new system was live, a follow-up check showed the number of name server change requests reduced to zero. DNS services showed 100% uptime in monitoring system, and the DNS query speeds had improved due to load balancing and dedicated processing power.
DNS plays a critical role in keeping the services online. A good DNS system should have the following:
1. Centralized DNS
All your domains should use the same set of DNS servers. It allows flexible provisioning, easy migrations, and easy server consolidations.
2. DNS cluster
The name servers should consist of multiple geographically separate nodes, so that performance deteriorations or service breaks in any one node will not cause the domains to go down. Additionally, the resulting load balancing will result in faster DNS query times.
3. Name server hardening
Regular audits should be done on name servers to block abuse attempts like DNS cache poisoning, Denial of Service, zone leaks, etc.
4. Performance monitoring
Quick service response times are determined in part by the responsiveness of DNS servers. Ideally a DNS request should be completed within 120-150 ms. Your monitoring system should be able to map the performance of the name server, and settings should be continually optimized for fast response.
Business challenges vary from one service provider to another. Bobcares Linux administrators design and implement custom solutions for managing servers and improving support processes. Are you looking for ways to improve your service?