Rebuilding failed Linux software RAID
This post will explain how I have rebuilt a software RAID array after a disk failure
I recently got a notification from the SMART daemon saying this:
<code>This email was generated by the smartd daemon running on: host name: gateway.domain.be DNS domain: domain.be NIS domain: (none) The following warning/error was logged by the smartd daemon: Device: /dev/hdd, 131 Currently unreadable (pending) sectors For details see host's SYSLOG (default: /var/log/messages). You can also use the smartctl utility for further investigation. No additional email messages about this problem will be sent.</code>
SMART values for the drives were these:
<code>[root @ SERVEUR](1076)# smartctl -A /dev/hdd1 smartctl version 5.33 [i686-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF READ SMART DATA SECTION === SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0007 095 092 021 Pre-fail Always - 4308 4 Start_Stop_Count 0x0032 099 099 040 Old_age Always - 1070 5 Reallocated_Sector_Ct 0x0033 195 195 140 Pre-fail Always - 74 7 Seek_Error_Rate 0x000b 200 200 051 Pre-fail Always - 0 9 Power_On_Hours 0x0032 069 069 000 Old_age Always - 23159 10 Spin_Retry_Count 0x0013 100 100 051 Pre-fail Always - 0 11 Calibration_Retry_Count 0x0013 100 100 051 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 1055 196 Reallocated_Event_Count 0x0032 192 192 000 Old_age Always - 8 197 Current_Pending_Sector 0x0012 200 187 000 Old_age Always - 131 198 Offline_Uncorrectable 0x0012 200 187 000 Old_age Always - 0 199 UDMA_CRC_Error_Count 0x000a 200 253 000 Old_age Always - 67 200 Multi_Zone_Error_Rate 0x0009 200 190 051 Pre-fail Offline - 0</code>
I removed the drive from the system, leading the RAID array to run degraded.
I got a new notification from the RAID system :
<code>Subject : DegradedArray event on /dev/md0:gateway.domain.be This is an automatically generated mail message from mdadm running on gateway.domain.be A DegradedArray event had been detected on md device /dev/md0. Faithfully yours, etc.</code>
I did a low level format which fixed the pending sectors (not sure how long the drive will keep working though)..
I was able to put the disk back in and rebuild the RAID array, follow the steps :
When you look at a "normal" array, you see something like this:
<code># cat /proc/mdstat Personalities : [raid1] md0 : active raid1 hdc1 hdd1 58613056 blocks [2/2] [UU] unused devices: </code>
That's the normal state - what you want it to look like. When a drive has failed and been replaced, it looks like this:
<code># cat /proc/mdstat Personalities : [raid1] md0 : active raid1 hdc1 58613056 blocks [2/1] [_U] unused devices: </code>
Notice that it doesn't list the failed drive parts, and that an underscore appears. This means that only one drive is active in these arrays - we have no mirror.
Another command that will show us the state of the raid drives is "mdadm"
<code>[root @ SERVEUR](1074)# mdadm -D /dev/md0 /dev/md0: Version : 00.90.01 Creation Time : Wed Aug 4 20:44:29 2004 Raid Level : raid1 Array Size : 58613056 (55.90 GiB 60.02 GB) Device Size : 58613056 (55.90 GiB 60.02 GB) Raid Devices : 2 Total Devices : 1 Preferred Minor : 0 Persistence : Superblock is persistent Update Time : Fri Oct 15 06:25:45 2004 State : dirty, no-errors Active Devices : 1 Working Devices : 1 Failed Devices : 0 Spare Devices : 0 Number Major Minor RaidDevice State 0 22 1 0 active sync /dev/hdc1 1 0 0 0 faulty removed UUID : 89e85ea3:9e8f8a62:f38a0ead:f24d72e3 Events : 0.58886</code>
As this shows, we currently only have one drive working in the array.
Although I already knew that /dev/hdd was the other part of the raid array, you can look at /etc/raidtab to see how the raid was defined:
<code>raiddev /dev/md0 raid-level 1 nr-raid-disks 2 chunk-size 64k persistent-superblock 1 nr-spare-disks 0 device /dev/hdc1 raid-disk 0 device /dev/hdd1 raid-disk 1</code>
To get the mirrored drives working properly again, we need to run fdisk to see what partition we have on the working drive:
<code>[root @ SERVEUR](1070)# fdisk /dev/hdc The number of cylinders for this disk is set to 7297. There is nothing wrong with that, but this is larger than 1024, and could in certain setups cause problems with: 1) software that runs at boot time (e.g., old versions of LILO) 2) booting and partitioning software from other OSs (e.g., DOS FDISK, OS/2 FDISK) Command (m for help): p Disk /dev/hdc: 60.0 GB, 60022480896 bytes 255 heads, 63 sectors/track, 7297 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/hdc1 * 1 7297 58613121 fd Linux raid autodetect</code>
Duplicate that on /dev/hdd.
"fdisk /dev/hdd", then use "n" to create the partitions, enter 1 and primary, then "t" to change their type to "fd" to match.
Seems like there's an easier way to duplicate the partition tab (not tested !):
sfdisk -d /dev/hdc | sfdisk /dev/hdd
Once this is done, use "raidhotadd":
# raidhotadd /dev/md0 /dev/hdd1
The rebuilding can be seen in /proc/mdstat:
# cat /proc/mdstat
Personalities : [raid1]
read_ahead 1024 sectors
md0 : active raid1 hdb1 hda1
58613056 blocks [2/1] [_U]
[>....................] recovery = 0.2% (250108/58613056 ) finish=28.8min speed=30032K/sec
After it finishes, it will show:
<code>[root @ SERVEUR](1074)# mdadm -D /dev/md0 /dev/md0: Version : 00.90.01 Creation Time : Wed Aug 4 20:44:29 2004 Raid Level : raid1 Array Size : 58613056 (55.90 GiB 60.02 GB) Device Size : 58613056 (55.90 GiB 60.02 GB) Raid Devices : 2 Total Devices : 2 Preferred Minor : 0 Persistence : Superblock is persistent Update Time : Thu May 11 22:29:05 2006 State : clean Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 Number Major Minor RaidDevice State 0 22 1 0 active sync /dev/hdc1 1 22 65 1 active sync /dev/hdd1 UUID : 89e85ea3:9e8f8a62:f38a0ead:f24d72e3 Events : 0.58886</code>
You can restart the raid array (raidstart /dev/md0) and remount the drive if needed (eg : mount /dev/md0 /home)
New link (02/2007) : http://www.howtoforge.com/replacing_hard_disks_in_a_raid1_array