It was a peaceful night shift at a data center we managed. Just a few routine server provisioning and customer queries were keeping us occupied. Suddenly all alarm bells started ringing.
25+ managed server instances had gone offline, and the alert priority was among the highest. Each passing minute was eating into our SLA guarantee. An OpenVZ node had gone down with almost no warning at all. The monitors had shown a slight increase in load, but well within normal range.
OK, first order of business, bring the server back online. The OpenVZ kernel booted up, and all instances were back online in less than 15 minutes, but that cut our uptime to 99.96%. We just cannot afford to let it happen again, and so, we started digging.
The server went down at 2:36 am, but the messages log stopped recording at 2:24 am. Nothing before it. System logs and error logs all seemed normal. Maybe we’ll get some clues from outside the server, and so we looked at the System Event Log of the IPMI. And there is was – at around 2:24 am, two devices had reported a fatal error causing a “Critical Interrupt”. Cross verifying the device number with the system configuration, we found that it was the Broadcom ethernet card that had failed, causing the server resources to be held indefinitely, and causing a server freeze.
A scheduled maintenance was announced, and a new card was put into the server. A diagnostic check of the old NIC confirmed our initial analysis of a chip failure.
Network cards have one of the lowest failure rates, but they do happen. Typically it takes just 5 minutes to change the card, which combined with the time to react to the issue and booting up time will chalk up 15 minutes of downtime. For mission critical applications which cannot afford a downtime of more than 99.99%, NIC bonding can be used to ensure system uptime even in case of a network card failure. Additionally, custom monitoring system plugins can be created to monitor NIC health, allowing fast replacement in case of a failure.
Bobcares systems administrators help data centers and web hosts to build and maintain fault tolerant systems. Are you looking for ways to improve your SLA compliance?
0 Comments