It is unwelcome, it is tedious, but it is inevitable.
Every service provider dreads a hard disk crash, and the downtime it can lead to, but it is one eventuality that will happen sooner or later.
Today was one such day. A high priority alert notified our Dedicated Linux Server Administrators about a degraded RAID array in a data center we managed. Hard disk crashes are a P0 (highest priority) alert in our infrastructure management procedures, and initiates an emergency response.
A status check on the 3ware RAID 5 array showed that only one of the disks had failed, which meant a recovery was possible. The following was then done in preparation to the RAID rebuild:
- Confirm the availability of a compatible spare hard disk drive. This was easily done through the asset management software.
- Arrange for a diagnostic check on the spare hard disk to make sure it is fully fit to be installed.
- Confirm that backups are available for the data in the server, and if not, initiate a backup process.
- Schedule a downtime as soon as backups are completed, as one failed disk in a RAID 5 meant that fault tolerance was lost at that point.
After confirming the backups were updated, and new disk is fully healthy, the server was brought offline. Of course, a hot swap was possible, but the rebuild would be faster if there are no more new data being added to the array. Faster rebuilds meant that the risk of losing more than 2 disks at the same time would be minimized. After removing the old drive, its sequence number was written on to the new drive, and inserted into the bay.
3ware Command Line Interface was then used to rescan and rebuild the array. The rebuild was completed in 3 hours.
RAID recovery can fail if not handled correctly. Here are a few tips from our RAID recovery play book.
- Once RAID goes into degraded status, you are on leased time, especially for RAID levels 1, 2, 3, 4, 5, 7, etc., with one hard disk fault tolerance. So, schedule a rebuild as soon as possible.
- Before a RAID rebuild starts, confirm that latest backups are available. If not, take a backup as soon as possible.
- Before a RAID rebuild starts, confirm that the RAID controller and new disks are completely healthy through self-tests.
- Always keep the inventory ready for action with at least two compatible drives. When one is used, always order a replacement.
- Do not run commands like chkdsk to fix RAID errors. It’ll corrupt the data.
- Attempt a rebuild only in case of a single drive failure. If more than one drive has failed, backup restore would be the better option.
Bobcares systems administrators take care of tech support and infrastructure management for data centers and web hosts. Are you looking for ways to improve your service quality?
0 Comments