RAID recovery on SATA drives

fernando and i tested a raid failure on jupiter:

jupiter:~# mdadm /dev/md2 -f /dev/sdd
jupiter:~# 

it automatically started rebuilding the RAID array into the hot spare. When this was over, i physically removed the disk, and tried to read the first 20 bytes of it:

jupiter:~# hd -n 20 /dev/sdd
jupiter:~# 

this produced a bunch of errors in /var/log/syslog that looked like this:

Apr  6 17:05:17 jupiter kernel: 3w-9xxx: scsi0: ERROR: (0x03:0x0203): ADP level 1 error:port=3.
Apr  6 17:05:17 jupiter kernel: SCSI error : <0 0 3 0> return code = 0x8000004
Apr  6 17:05:17 jupiter kernel: Current sdd: sense key Hardware Error
Apr  6 17:05:17 jupiter kernel: Additional sense: Command phase error
Apr  6 17:05:17 jupiter kernel: end_request: I/O error, dev sdd, sector 120
Apr  6 17:05:17 jupiter kernel: Buffer I/O error on device sdd, logical block 15

i then replaced the disk, and tried the above command again, which resulted in the following brief error in /var/log/syslog:

Apr  6 17:09:06 jupiter kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x003A): Drive power on reset detected:port=3.
Apr  6 17:09:06 jupiter kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0025): Error flushing cached write data to array:unit=3.

i also noticed that sdd was still showing up in /proc/partitions.

next, i tried explicitly revoking the device's SCSI presence in the kernel (thanks to some documentation from IBM):

jupiter:~# echo '1' > /sys/bus/scsi/devices/0\:0\:3\:0/delete 
jupiter:~# 

after this command, sdd no longer shows up in /proc/partitions. Note that this command could probably also have been issued as

jupiter:~# echo '1' > /sys/class/scsi_device/0\:0\:3\:0/device/delete
jupiter:~# 

To rescan the device, i did:

echo '0 3 0' > /sys/class/scsi_host/host0/scan   

This added the device as sdi, unfortunately, and put the following lines in /var/log/syslog:

Apr  6 17:36:19 jupiter kernel:   Vendor: AMCC      Model: 9500S-8    DISK   Rev: 2.06
Apr  6 17:36:19 jupiter kernel:   Type:   Direct-Access                      ANSI SCSI revision: 03
Apr  6 17:36:19 jupiter kernel: SCSI device sdi: 781228032 512-byte hdwr sectors (399989 MB)
Apr  6 17:36:19 jupiter kernel: SCSI device sdi: drive cache: write back, no read (daft)
Apr  6 17:36:19 jupiter kernel:  /dev/scsi/host0/bus0/target3/lun0: unknown partition table
Apr  6 17:36:19 jupiter kernel: Attached scsi disk sdi at scsi0, channel 0, id 3, lun 0
Apr  6 17:36:20 jupiter scsi.agent[10456]:      sd_mod: loaded sucessfully (for disk)

note that sdi showed up in /proc/partitions, but only at the very end of the list of block devices.

i then patched up the raid array by explicitly removing the old bogus block device, and hot-adding the newly-registered disk:

jupiter:~# mdadm /dev/md2 -r /dev/sdd
jupiter:~# mdadm /dev/md2 -a /dev/sdi
jupiter:~# 

real HW failure

fernando did a hard hardware failure by yanking the disk from sedna. sedna was not happy until the device was plugged back in. some notes about this: http://lxr.linux.no/source/drivers/md/md.c http://groups.google.com/group/mlist.linux.raid/browse_thread/thread/45fa19f1bd528a2a/63b96c9230222d84%2363b96c9230222d84

(remember to paste in some logs from sedna here)