Bobcares

Cornering an SLA killer – How systematic resolution of an OpenVZ crash protected uptime guarantees

by | Jan 23, 2015

It was a peaceful night shift at a data center we managed. Just a few routine server provisioning and customer queries were keeping us occupied. Suddenly all alarm bells started ringing.

25+ managed server instances had gone offline, and the alert priority was among the highest. Each passing minute was eating into our SLA guarantee. An OpenVZ node had gone down with almost no warning at all. The monitors had shown a slight increase in load, but well within normal range.

OK, first order of business, bring the server back online. The OpenVZ kernel booted up, and all instances were back online in less than 15 minutes, but that cut our uptime to 99.96%. We just cannot afford to let it happen again, and so, we started digging.

The server went down at 2:36 am, but the messages log stopped recording at 2:24 am. Nothing before it. System logs and error logs all seemed normal. Maybe we’ll get some clues from outside the server, and so we looked at the System Event Log of the IPMI. And there is was – at around 2:24 am, two devices had reported a fatal error causing a “Critical Interrupt”. Cross verifying the device number with the system configuration, we found that it was the Broadcom ethernet card that had failed, causing the server resources to be held indefinitely, and causing a server freeze.

A scheduled maintenance was announced, and a new card was put into the server. A diagnostic check of the old NIC confirmed our initial analysis of a chip failure.

Network cards have one of the lowest failure rates, but they do happen. Typically it takes just 5 minutes to change the card, which combined with the time to react to the issue and booting up time will chalk up 15 minutes of downtime. For mission critical applications which cannot afford a downtime of more than 99.99%, NIC bonding can be used to ensure system uptime even in case of a network card failure. Additionally, custom monitoring system plugins can be created to monitor NIC health, allowing fast replacement in case of a failure.

Bobcares systems administrators help data centers and web hosts to build and maintain fault tolerant systems. Are you looking for ways to improve your SLA compliance?

See how we can help

 

0 Comments

Never again lose customers to poor
server speed! Let us help you.

Privacy Preference Center

Necessary

Necessary cookies help make a website usable by enabling basic functions like page navigation and access to secure areas of the website. The website cannot function properly without these cookies.

PHPSESSID - Preserves user session state across page requests.

gdpr[consent_types] - Used to store user consents.

gdpr[allowed_cookies] - Used to store user allowed cookies.

PHPSESSID, gdpr[consent_types], gdpr[allowed_cookies]
PHPSESSID
WHMCSpKDlPzh2chML

Statistics

Statistic cookies help website owners to understand how visitors interact with websites by collecting and reporting information anonymously.

_ga - Preserves user session state across page requests.

_gat - Used by Google Analytics to throttle request rate

_gid - Registers a unique ID that is used to generate statistical data on how you use the website.

smartlookCookie - Used to collect user device and location information of the site visitors to improve the websites User Experience.

_ga, _gat, _gid
_ga, _gat, _gid
smartlookCookie
_clck, _clsk, CLID, ANONCHK, MR, MUID, SM

Marketing

Marketing cookies are used to track visitors across websites. The intention is to display ads that are relevant and engaging for the individual user and thereby more valuable for publishers and third party advertisers.

IDE - Used by Google DoubleClick to register and report the website user's actions after viewing or clicking one of the advertiser's ads with the purpose of measuring the efficacy of an ad and to present targeted ads to the user.

test_cookie - Used to check if the user's browser supports cookies.

1P_JAR - Google cookie. These cookies are used to collect website statistics and track conversion rates.

NID - Registers a unique ID that identifies a returning user's device. The ID is used for serving ads that are most relevant to the user.

DV - Google ad personalisation

_reb2bgeo - The visitor's geographical location

_reb2bloaded - Whether or not the script loaded for the visitor

_reb2bref - The referring URL for the visit

_reb2bsessionID - The visitor's RB2B session ID

_reb2buid - The visitor's RB2B user ID

IDE, test_cookie, 1P_JAR, NID, DV, NID
IDE, test_cookie
1P_JAR, NID, DV
NID
hblid
_reb2bgeo, _reb2bloaded, _reb2bref, _reb2bsessionID, _reb2buid

Security

These are essential site cookies, used by the google reCAPTCHA. These cookies use an unique identifier to verify if a visitor is human or a bot.

SID, APISID, HSID, NID, PREF
SID, APISID, HSID, NID, PREF