The data was conclusive. The servers orion-47, orion-50 and orion-52 needed RAM upgrades. In the past one month their RAM usage has mostly been above 85% and was showing an increasing trend. Swap usage has grown by more than 20% and it was resulting in higher I/O wait and thereby a slight tendency for high load.
These servers were part of a load balancing cluster that served a SaaS application in a data center managed by our Dedicated Linux Systems Administrators. The occasion was our weekly review of alert trends, and corrective actions needed to prevent a performance degradation. Regular analysis of alert trends allow us to predict future resource bottle necks, and prevent service deterioration.
It is unwelcome, it is tedious, but it is inevitable.
Every service provider dreads a hard disk crash, and the downtime it can lead to, but it is one eventuality that will happen sooner or later.
Today was one such day. A high priority alert notified our Dedicated Linux Server Administrators about a degraded RAID array in a data center we managed. Hard disk crashes are a P0 (highest priority) alert in our infrastructure management procedures, and initiates an emergency response.
“I did nothing. It just crashes all the time!”
So began a professional administration request at the help desk of a data center we managed. The customer’s unmanaged Windows 2008 R2 VPS started crashing one fine day without any apparent reason.
The event logs didn’t show anything out of the ordinary. So, the next step was to analyze the crash dump.
“Logs from alpha-p3 is missing!”
We were responding to an issue raised by an onsite technician for a data center we managed. System logs from one server was missing in the central log server. It looked like the Rsyslog service that was used for central logging had crashed in the source server, leading to 2 hours of lost log information.
Logs are critical to day-to-day server management and missing logs were an urgent priority issue. Rsyslog service was restarted in the source server, and debugging was enabled to identify what had gone wrong. Looking at the update logs, we noted that the Rsyslog package was recently updated, which pointed to a possible bug. A quick stop at the Rsyslog github bug database confirmed that crashes were reported, and a patch was available. An update was done in all servers to fix the issue. But it still left the question, what if a future update causes a similar crash? We needed a solution to ensure the central logging is resilient to failure. (more…)
It was a peaceful night shift at a data center we managed. Just a few routine server provisioning and customer queries were keeping us occupied. Suddenly all alarm bells started ringing.
25+ managed server instances had gone offline, and the alert priority was among the highest. Each passing minute was eating into our SLA guarantee. An OpenVZ node had gone down with almost no warning at all. The monitors had shown a slight increase in load, but well within normal range.
OK, first order of business, bring the server back online. The OpenVZ kernel booted up, and all instances were back online in less than 15 minutes, but that cut our uptime to 99.96%. We just cannot afford to let it happen again, and so, we started digging.
“This definitely is a problem with your monitoring system! I never used this bandwidth. I was on holiday!”
The accounts department of the data center we managed referred this customer concern to us. His un-managed dedicated server showed a bandwidth spike of 20 times the normal usage, and had resulted in bandwidth overages charges.
The monitoring system was showing perfect stats for all other servers, and it looked like something that happened in the customer’s server.