Web hosting monitoring – What is it and how to do it right
Bobcares.com provides Technical Support Services for web hosts, digital marketers, and other hosting providers.
As part of our work we monitor web hosting infrastructure, and make sure the services remain responsive at all times.
So, what does this involve? How do we make sure end-users are not affected at any time?
This post is about that. Read on.
What is Web hosting monitoring?
Web hosting monitoring is a proactive approach to keeping hosting customers happy.
Instead of reacting to a downtime after it happens, proactive web hosting monitoring is used to prevent a service issue.
Here at Bobcares, it primarily involves 2 things:
- 24/7 human monitoring & Emergency response – Hosting experts monitor servers 24/7 for availability, performance or security issues. If any metric we monitor goes outside the expected limits, we quickly login to the server and fix the affected service.
- Trend analysis and Proactive patching – Experts continually analyze the trend of server alerts (eg. an increase in high load events, increase in mail queue size, etc.). If we notice a poor trend in server health, we immediately optimize, patch or secure the affected service to prevent a downtime.
Since each metric is manually verified by an expert, we are able to predict future server issues, and take proactive steps to avoid a service downtime.
This helps hosting providers deliver competitive service SLAs and get an edge in the market.
What to monitor in your Web hosting infrastructure
There are literally hundreds of service metrics you can monitor in a web hosting infrastructure.
It ranges from something as common as server CPU load to things as obscure as web application cache size.
What metrics we give priority to for a specific client, depends entirely on the kind of hosting solution they provide.
That said, here are the main categories of service metrics we cover in all servers:
- User experience (availability & performance) – Measure the availability time and responsiveness of your services from a server outside your hosting network. You should get instant alerts when a service goes down or becomes slow to respond.
- Security – Monitor status reports from various security tools installed in your server and from security news channels. You need to take quick action if a security event happens in your server (eg. spamming or brute forcing), or if a new app vulnerability is disclosed in security channels.
- Hardware health – Establish monitoring methods to detect HDD errors, RAID health, NIC errors, etc. Hardware errors can be hard to detect, but if detected ahead of time, you can prevent a major downtime.
- SLA – As a service provider you’ll need to guarantee certain quality of service, and you’ll need data to monitor if you’re actually delivering it. Your monitoring system should be outside your network, and should measure uptime, speed and other metrics that are important for your business.
- Disaster recovery readiness – All hosting companies have a fall-back option in case the servers fail. This can include high-availability, failover mirrors, or even simple backups. By monitoring these systems you can quickly fix any error, and make sure your systems can quickly recover from a crash.
Now, let’s take a look at this in a bit more detail:
1. Availability & Performance monitoring
This is perhaps the most common of all kinds of monitoring. Everyone wants to know if their services are online and responsive.
But we’ve seen poorly implemented and operated systems that give false data on uptime and performance.
To ensure accurate alerts, we follow a set of tried-and-tested best practices. Some of them are:
- External monitoring – Many companies setup a monitoring system in a spare server within their network. In such a configuration, the monitoring server will never know if the network is unreachable or slow (due to network congestion, routing issues, etc.). So, the measurements will never be the same as what end-users experience. To avoid this issue in our customer servers, we use servers outside the network to monitor and collect data about server performance.
- Resource usage monitoring – All server owners know about CPU and Memory. But there are other choke points in a server, such as Disk I/O, Net I/O, Steal CPU (for VPS), Swap, etc. that can slow down applications. So, in our customer servers, we make sure all these choke points are monitored and data logged.
- Service performance monitoring – Some key services such as MySQL and Web servers can cause service lag due to misconfiguration or traffic spikes. Depending on the kind of hosting, we monitor specific service metrics such as no: of connections, web error rate, response latency, and more. It helps us detect and fix a performance issue way before it is noticed by customers.
- Application monitoring – We’ve seen cases where web servers remain online, but sites show errors due to PHP module issues, DB connection limits, etc. That is why we set up application level monitoring as well that includes URL content check, application error rate check, and more. It has often given us early warning about server wide errors before visitors notice them.
2. Security monitoring
Many server owners think that as long as they have setup an anti-virus or an anti-spam software, things are going to be OK.
Attackers keep finding new ways to get into a server (new vulnerabilities, exploits, phishing methods, etc.).
So, it is important to monitor abnormal activities in a server and take action if you suspect foul play.
Here are a few metrics we’ve found to be useful in monitoring server security:
- Connection monitoring – Every server has a pattern of connections for each of its services. For eg. 10-40 web connections during night and 150-200 during day time. An increase in connection number can mean DoS attacks, Comment spamming, Brute forcing or even automated malware injection attempts. By reacting quickly to such alerts, we’ve been able to prevent several zero-day mass exploits.
- Authentication monitoring – People use weak passwords all the time. That is why brute force attacks such as dictionary attacks (trying common passwords) are still successful. We monitor authentication logs for repeated login failures, logins from unusual IPs, admin login attempts, etc. to detect and prevent account hacks.
- Malware uploads – Thousands of sites are blacklisted by Google every day for malware infection. We employ web application firewalls, process scanning and disk scanning to detect any malware that enters the server. If we detect a successful malware intrusion, we fortify the firewalls with the new signature so that future uploads will be automatically blocked.
- Vulnerability disclosures – Security researchers disclose new vulnerabilities almost every day. We keep an eye on security channels, and patch any new vulnerability that’s still present in servers. If an official patch is not available, we use a hotfix to block execution of exploits until we can implement a full patch.
- Anti-malware database updates – For the server to detect new viruses and malware, the database should have the latest database. We monitor new updates, and apply the patches as soon as they are available.
- Security updates – Operating system vendors such as RedHat, Ubuntu and Windows periodically release security patches. We monitor these high priority updates, and apply the patches as soon as they are available.
3. Hardware health monitoring
Many server owners come to know about a hardware issue only after the server has gone down.
In reality, almost all of these issues can be detected well in advance if you listen to the right signals.
For eg. if the system logs show I/O errors, that means that the hard disk might die soon.
Here at Bobcares, we make it a point to monitor the health of all hardware components. This includes, CPU temperature, fan speed, RAID status, Disk errors, NIC errors, and more.
Timely detection of these issues helps us to schedule a hardware change during non-business hours, and thereby minimize the impact of a service downtime.
4. SLA monitoring
In web hosting business, the two most important metrics for good service are : Availability and Responsiveness.
The availability (aka uptime) guarantees range from 99.7% to 99.999%.
Similarly, we’ve seen responsiveness guarantees of 5 secs to 2 secs for SaaS apps and managed hosting platforms.
No matter what the SLA is, every parameter that’s promised to the customer must be thoroughly monitored. Some best practices in this are:
- Distributed monitoring – Like all servers, monitoring servers can also fail. But that shouldn’t break your data. That is why we log the SLA metrics of our customer servers from multiple locations, and use fault-tolerant systems where possible.
- Early warning – We configure the SLA monitoring systems to alert us when a metric comes near 70% of the threshold limit. This allows us to take preventive actions and reverse the trend to prevent SLA breach.
- Service status display – Display of service status is a show of confidence on your systems. It shows customers that we own up to our uptime guarantee, and are capable of meeting any issue that may come up. So, on many of our customer’s sites we show the uptime statistics and service status.
5. Backups / HA monitoring
Backups are any company’s insurance against failures.
That is why we give high importance to backup and HA monitoring. Some common metrics include:
- Backup completion check – If a success message was displayed after the backup process was completed.
- Backup error rate – The number of errors logged by backup processes.
- Disk usage check – The amount of free space available for backup. We never let it go less than 20%.
- HA sync latency – If the databases or files have struck a bottleneck during data sync, which can cause a sync failure.
The goal of web hosting monitoring is to prevent unexpected downtimes by monitoring the infrastructure 24/7 and proactively patch or optimize the systems to avoid service errors or failure. Today we’ve seen the top few best practices we follow to achieve this goal.