Repairing a RAID1 Array

Most of the CAL servers have two disks with the OS on them, configured in a dual-RAID1: one small RAID1 at the start of the disk as a boot partition, and one big RAID1 that is the rest of the disk and acts as a physical volume for LVM.

If a disk fails (as mars did), here's a safe (non-hotswap) way to replace it: (note: i'll use /dev/baddisk, /dev/gooddisk, and /dev/newdisk to refer to the disks, and /dev/baddisk1, etc. to refer to the partitions. obviously, when a device does fail, it will have a real name. replace the device names appropriately.)

remove all partitions from the bad disk from existing existing raid arrays:

mdadm --fail /dev/md0 /dev/baddisk1
mdadm --remove /dev/md0 /dev/baddisk1
mdadm --fail /dev/md1 /dev/baddisk2
mdadm --remove /dev/md1 /dev/baddisk2

use smartctl to record the serial information from the bad disk so that you can tell you got the right one when it comes out:

smartctl -a -d ata /dev/baddisk

make sure you've got grub installed on the other disk so you'll be able to reboot:

grub-install /dev/gooddisk

shut down the machine, yank the disk, check its serial number against the one you recorded from smartctl, put in the new one. boot the machine again.

examine /proc/partitions to make sure that the sizes of /dev/newdisk and /dev/gooddisk are the same (the sizes of the partitions won't match yet.

copy the master boot record of the good disk into the new disk:

dd if=/dev/gooddisk of=/dev/newdisk bs=512 count=1

reread the partition table of the new disk so that the partitions are available:

hdparm -z /dev/newdisk

add the new partitions to the appropriate mirrors:

mdadm --add /dev/md0 /dev/newdisk1
mdadm --add /dev/md1 /dev/newdisk2

make sure grub is installed properly on the new disk so you'll be able to boot from it should you need to:

grub-install /dev/newdisk

and that's it! if you want to monitor the progress of the RAID rebuild, you can do so with

cat /proc/mdstat