Select Page

linux administration


Fault tolerant service logging – How remote logging was made resilient to crashes

Logs from alpha-p3 is missing!

We were responding to an issue raised by an onsite technician for a data center we managed. System logs from one server was missing in the central log server. It looked like the Rsyslog service that was used for central logging had crashed in the source server, leading to 2 hours of lost log information.

Logs are critical to day-to-day server management and missing logs were an urgent priority issue. Rsyslog service was restarted in the source server, and debugging was enabled to identify what had gone wrong. Looking at the update logs, we noted that the Rsyslog package was recently updated, which pointed to a possible bug. A quick stop at the Rsyslog github bug database confirmed that crashes were reported, and a patch was available. An update was done in all servers to fix the issue. But it still left the question, what if a future update causes a similar crash? We needed a solution to ensure the central logging is resilient to failure. (more…)

Cornering an SLA killer – How systematic resolution of an OpenVZ crash protected uptime guarantees

Cornering an SLA killer – How systematic resolution of an OpenVZ crash protected uptime guarantees

It was a peaceful night shift at a data center we managed. Just a few routine server provisioning and customer queries were keeping us occupied. Suddenly all alarm bells started ringing.

25+ managed server instances had gone offline, and the alert priority was among the highest. Each passing minute was eating into our SLA guarantee. An OpenVZ node had gone down with almost no warning at all. The monitors had shown a slight increase in load, but well within normal range.

OK, first order of business, bring the server back online. The OpenVZ kernel booted up, and all instances were back online in less than 15 minutes, but that cut our uptime to 99.96%. We just cannot afford to let it happen again, and so, we started digging.
(more…)