ZFS resilver from disk with bad sectors constantly restarts
Joseph Mingrone
jrm at ftfl.ca
Wed Dec 28 15:42:57 UTC 2016
Dmitry Marakasov <amdmi3 at amdmi3.ru> writes:
> I've just got a case where resilvering a new replacement disk in raidz2
> never finished.
> The problem: one disk in raidz is failing by having a large number of
> unreadable sectors. It's replaced with a spare. Resilver though is
> constantly restarted with log full of read error from bad disk.
> It looks like this:
> ---
> pool: spool
> state: ONLINE
> status: One or more devices is currently being resilvered. The pool will
> continue to function, possibly in a degraded state.
> action: Wait for the resilver to complete.
> scan: resilver in progress since Wed Oct 28 05:26:28 2015
> 369G scanned out of 9,87T at 123M/s, 22h29m to go
> 41,4G resilvered, 3,65% done
> config:
> NAME STATE READ WRITE CKSUM
> spool ONLINE 0 0 0
> raidz1-0 ONLINE 0 0 0
> ada0 ONLINE 0 0 0
> ada1 ONLINE 0 0 0
> spare-2 ONLINE 0 0 733
> ada11 ONLINE 0 0 0
> ada2 ONLINE 0 0 0 (resilvering)
> raidz1-1 ONLINE 0 0 0
> ada3 ONLINE 0 0 0
> ada4 ONLINE 0 0 0
> ada5 ONLINE 0 0 0
> raidz1-2 ONLINE 0 0 0
> ada6 ONLINE 0 0 0
> ada7 ONLINE 0 0 0
> ada10 ONLINE 0 0 0
> spares
> 588540573008830286 INUSE was /dev/ada2
> errors: No known data errors
> ---
> `resilver in progress since' date is constantly reset, so resilved
> progress cannot pass beyond 5% or so. My guess is that it happens on
> read errors on ada11. I think I've seen (resilvering) on ada11 line
> couple of times.
> In the end I've had to offline ada11 and after that resilver completed
> in under 16 hours. However the situation doesn't seem normal, as I'd
> prefer to not lose redundancy with offlining dying disk and still be
> able to use it for resilvering (imagine there were bad sectors on ada0/1
> as well, but not intersecting with bad sectors on ada11), or at least
> some more verbose indication of why the resilver is constantly restarted.
> I should also note that's outdated FreeBSD 9.1, so maybe that problem
> was fixed already.
We have been dealing with, what seems to be, the same issue on 11.0-RELEASE with
a two-raidz1-vdev pool. You said that your issue was with a raidz2, but your
zpool status output shows raidz1. The problem disk had checksum mismatches and
smart was reporting errors, but it was still online. The resilver would make it
through many hours, but then restart. This loop went on for a few days. As in
your case, after offlining the problem disk, the replacement finished.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 930 bytes
Desc: not available
URL: <http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20161228/df9a5549/attachment.sig>
More information about the freebsd-fs
mailing list