It was a peaceful night shift at a data center we managed. Just a few routine server provisioning and customer queries were keeping us occupied. Suddenly all alarm bells started ringing.
25+ managed server instances had gone offline, and the alert priority was among the highest. Each passing minute was eating into our SLA guarantee. An OpenVZ node had gone down with almost no warning at all. The monitors had shown a slight increase in load, but well within normal range.
OK, first order of business, bring the server back online. The OpenVZ kernel booted up, and all instances were back online in less than 15 minutes, but that cut our uptime to 99.96%. We just cannot afford to let it happen again, and so, we started digging.