| Version 1 (modified by , 20 years ago) ( diff ) |
|---|
RAID recovery on SATA drives
fernando and i tested a raid failure on jupiter:
jupiter:~# mdadm /dev/md2 -f /dev/sdd jupiter:~#
it automatically started rebuilding the RAID array into the hot spare. When this was over, i physically removed the disk, and tried to read the first 20 bytes of it:
jupiter:~# hd -n 20 /dev/sdd jupiter:~#
this produced a bunch of errors in /var/log/syslog that looked like this:
Apr 6 17:05:17 jupiter kernel: 3w-9xxx: scsi0: ERROR: (0x03:0x0203): ADP level 1 error:port=3. Apr 6 17:05:17 jupiter kernel: SCSI error : <0 0 3 0> return code = 0x8000004 Apr 6 17:05:17 jupiter kernel: Current sdd: sense key Hardware Error Apr 6 17:05:17 jupiter kernel: Additional sense: Command phase error Apr 6 17:05:17 jupiter kernel: end_request: I/O error, dev sdd, sector 120 Apr 6 17:05:17 jupiter kernel: Buffer I/O error on device sdd, logical block 15
i then replaced the disk, and tried the above command again, which resulted in the following brief error in /var/log/syslog:
Apr 6 17:09:06 jupiter kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x003A): Drive power on reset detected:port=3. Apr 6 17:09:06 jupiter kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0025): Error flushing cached write data to array:unit=3.
i also noticed that sdd was still showing up in /proc/partitions.
next, i tried explicitly revoking the device's SCSI presence in the kernel (thanks to some documentation from IBM):
jupiter:~# echo '1' > /sys/bus/scsi/devices/0\:0\:3\:0/delete jupiter:~#
after this command, sdd no longer shows up in /proc/partitions. Note that this command could probably also have been issued as
jupiter:~# echo '1' > /sys/class/scsi_device/0\:0\:3\:0/device/delete jupiter:~#
To rescan the device, i did:
echo '0 3 0' > /sys/class/scsi_host/host0/scan
This added the device as sdi, unfortunately, and put the following lines in /var/log/syslog:
Apr 6 17:36:19 jupiter kernel: Vendor: AMCC Model: 9500S-8 DISK Rev: 2.06 Apr 6 17:36:19 jupiter kernel: Type: Direct-Access ANSI SCSI revision: 03 Apr 6 17:36:19 jupiter kernel: SCSI device sdi: 781228032 512-byte hdwr sectors (399989 MB) Apr 6 17:36:19 jupiter kernel: SCSI device sdi: drive cache: write back, no read (daft) Apr 6 17:36:19 jupiter kernel: /dev/scsi/host0/bus0/target3/lun0: unknown partition table Apr 6 17:36:19 jupiter kernel: Attached scsi disk sdi at scsi0, channel 0, id 3, lun 0 Apr 6 17:36:20 jupiter scsi.agent[10456]: sd_mod: loaded sucessfully (for disk)
note that sdi showed up in /proc/partitions, but only at the very end of the list of block devices.
i then patched up the raid array by explicitly removing the old bogus block device, and hot-adding the newly-registered disk:
jupiter:~# mdadm /dev/md2 -r /dev/sdd jupiter:~# mdadm /dev/md2 -a /dev/sdi jupiter:~#
