"Fixing" a RAID

Thu Jun 19 10:11:59 UTC 2008

Ryan Coleman wrote:

> Jun  4 23:02:28 testserver kernel: ar0: 715425MB <HighPoint v3
RocketRAID> RAID5 (stripe 64 KB)> status: READY
> Jun  4 23:02:28 testserver kernel: ar0: disk0 READY using ad13 at
ata6-slave
> Jun  4 23:02:28 testserver kernel: ar0: disk1 READY using ad16 at
ata8-master
> Jun  4 23:02:28 testserver kernel: ar0: disk2 READY using ad15 at
ata7-slave
> Jun  4 23:02:28 testserver kernel: ar0: disk3 READY using ad17 at
ata8-slave
> Jun  4 23:05:35 testserver kernel:
g_vfs_done():ar0s1c[READ(offset=501963358208, length=16384)]error = 5
> ...

My guess is that the rebuild failure is due to unreadable sectors on one
(or more) of the original three drives.

I recently had this happen to me on an 8 x 1 TB RAID-5 array on a
Highpoint RocketRAID 2340 controller. For some unknown reason two drives
developed unreadable sectors within hours of each other. To make a long
story short, the way I "fixed" this was to:

1. Used a tool I got from Highpoint tech-support to re-init the array
information (so the array was no longer marked as broken).
2. Unplugged both drives and hooked them up to another computer using a
regular SATA controller.
3. One of the drives was put through a complete "recondition" cycle(a).
4. The other drive was put through a partial "recondition" cycle(b).
5. I hooked up both drives to the 2340 controller again. The BIOS
immediately marked the array as degraded (because it didn't recognize
the wiped drive as part of the array), and I could re-add the wiped
drive so a rebuild of the array could start.
6. I finally ran a "zpool scrub" on the tank, and restored the few files
that had checksum errors.

(a) I tried to run a SMART long selftest, but it failed. I then
completely wiped the drive by writing zeroes to the entire surface,
allowing the firmware to remap the bad sectors. After this procedure the
long selftest succeeded. I finally used a diagnostic program from the
drive vendor (Western Digital) to again verify that the drive was
working properly.

(b) The SMART long selftest failed the first time, but after running a
surface scan using the diagnostic program from Western Digital the
selftest passed. I'm pretty sure the diagnostic program remapped the bad
sector, replacing it with a blank one. At least the program warned me to
back up all data before starting the surface scan. Alternatively I could
have used dd (with offset) to write to just the failed sector (available
in the SMART selftest log).

If I were you I would run all three drives through a SMART long
selftest. I'm sure you'll find that at least one of them will fail the
selftest. Use something like SpinRite 6 to recover the drive, or use dd
/ dd_rescue to copy the data to a fresh drive. Once all three of the
original drives pass a long selftest the array should be able to finish
a rebuild using a fourth (blank) drive.

By the way, don't try to use SpinRite 6 on 1 TB drives, it will fail
halfway through with a division-by-zero error. I haven't tried it on any
500 GB drives yet.

/Daniel Eriksson