Fault tolerant service logging – How remote logging was made resilient to crashes
“Logs from alpha-p3 is missing!”
We were responding to an issue raised by an onsite technician for a data center we managed. System logs from one server was missing in the central log server. It looked like the Rsyslog service that was used for central logging had crashed in the source server, leading to 2 hours of lost log information.
Logs are critical to day-to-day server management and missing logs were an urgent priority issue. Rsyslog service was restarted in the source server, and debugging was enabled to identify what had gone wrong. Looking at the update logs, we noted that the Rsyslog package was recently updated, which pointed to a possible bug. A quick stop at the Rsyslog github bug database confirmed that crashes were reported, and a patch was available. An update was done in all servers to fix the issue. But it still left the question, what if a future update causes a similar crash? We needed a solution to ensure the central logging is resilient to failure.
A custom Event Handler was created in the Nagios monitoring system to restart a Rsyslog service along with instantly alerting us of the crash. A test run showed that the system was working was intended. It notified us when Rsyslog was stopped and immediately restored the service with a maximum of 1.25 minutes delay.
Service failures can happen due to a variety of reasons which includes application bugs, hardware errors, resource limits, etc. A combination of systems engineering and prompt expert intervention helps us maintain high service uptime in data centers. In our operations, we ensure the following:
- An issue once reported, is not considered resolved until we have a solution to prevent its recurrence.
- Systems engineering is used where required to extensively monitor services and auto-restore service avilability.
- High priority is assigned to service down or service deterioration alerts. An expert engineer promptly responds to an alert, quickly restores the service, conducts a thorough root cause investigation, and implements a solution to prevent the recurrence of the issue in the future.
Bobcares systems administrators help data centers and web hosts to build and maintain robust services. Are you looking for ways to improve your service quality and uptime?