ZFS (zpool) doesn't detect failed drive
Steve Polyack
korvus at comcast.net
Wed May 5 15:17:14 UTC 2010
On 05/05/10 10:56, Harald Schmalzbauer wrote:
> Harald Schmalzbauer schrieb am 05.05.2010 14:41 (localtime):
>> Hello,
>>
>> one drive of my mirror failed today, but 'zpool staus' shows it
>> "online".
>> Every process using a ZFS mount hangs. Also 'zpool offline /dev/ad1'
>> hangs infinitely.
> ...
> Sorry, I made an error with zpool create. Somehow the little word
> "mirror" must have been lost. So the pool wasn't a mirror but a
> stripe. Then of course I can't make one vdev offline. Sorry for the
> noise.
> But I took the opportunity to do some tests with that failing drive
> and created a _real_ mirror. That works without failures, but using
> the mirror again leads to:
> ad1: TIMEOUT - FLUSHCACHE48 retrying (1 retry left)
> ad1: TIMEOUT - FLUSHCACHE48 retrying (1 retry left)
> ad1: TIMEOUT - FLUSHCACHE48 retrying (1 retry left)
> ad1: TIMEOUT - FLUSHCACHE48 retrying (1 retry left)
> ad1: TIMEOUT - FLUSHCACHE48 retrying (1 retry left)
> ad1: TIMEOUT - FLUSHCACHE48 retrying (1 retry left)
> ata3: port is not ready (timeout 10000ms) tfd = 00000080
> ata3: hardware reset timeout
> ad1: FAILURE - device detached
>
> Now zpool reporsts the vdev ad1 still online although it has been
> detached and 'atacontrol list' doesn't show it anymore:
>
> zpool status
> pool: URUBAmirrorP1
> state: ONLINE
> status: One or more devices has experienced an unrecoverable error. An
> attempt was made to correct the error. Applications are
> unaffected.
> action: Determine if the device needs to be replaced, and clear the
> errors
> using 'zpool clear' or replace the device with 'zpool replace'.
> see: http://www.sun.com/msg/ZFS-8000-9P
> scrub: none requested
> config:
>
> NAME STATE READ WRITE CKSUM
> URUBAmirrorP1 ONLINE 0 0 0
> mirror ONLINE 0 0 0
> ad1 ONLINE 3 302K 0
> ad2 ONLINE 0 0 0
>
> errors: No known data errors
>
> atacontrol list
> ATA channel 2:
> Master: ad0 <TRANSCEND/20090520> SATA revision 1.x
> Slave: no device present
> ATA channel 3:
> Master: no device present
> Slave: no device present
> ATA channel 4:
> Master: ad2 <SAMSUNG HD154UI/1AG01118> SATA revision 2.x
> Slave: no device present
> ATA channel 5:
> Master: ad3 <ST3750640NS/3.AEG> SATA revision 1.x
> Slave: no device present
>
> How should such a failure be handled?
> Do I have to manually mark the drive offline for zpool?
>
> Thanks,
>
> -Harry
>
You may want to try newer controller drivers like ahci(4) if possible.
Otherwise, building the kernel with ATA_CAM may accomplish something
similar. I'm not sure, but I'm speculating that the newer ATA/CAM
system may feed the proper notifications back to the ZFS systems.
I use many drives on the siis(4) driver, which is CAM-enabled, and
haven't had any issues. However, I have not had an outright drive
failure. I do recall testing situations where we would yank a working
drive, and I seem to remember it working correctly after the last set of
CAM improvements.
It may not be something you can try on a production system, but if you
can experiment, it's worth a shot. Note that your device names WILL
change to adaX instead of adX. I would definitely recommend you
glabel(8) and create the zpool/zdevs using the glabel devices instead to
circumvent any future problems associated with device numbering.
Steve
More information about the freebsd-stable
mailing list