ZFS resilver from disk with bad sectors constantly restarts

Wed Dec 28 15:42:57 UTC 2016

Dmitry Marakasov <amdmi3 at amdmi3.ru> writes:
> I've just got a case where resilvering a new replacement disk in raidz2
> never finished.

> The problem: one disk in raidz is failing by having a large number of
> unreadable sectors. It's replaced with a spare. Resilver though is
> constantly restarted with log full of read error from bad disk. 

> It looks like this:

> ---
>   pool: spool
>  state: ONLINE
> status: One or more devices is currently being resilvered.  The pool will
> 	continue to function, possibly in a degraded state.
> action: Wait for the resilver to complete.
>   scan: resilver in progress since Wed Oct 28 05:26:28 2015
>         369G scanned out of 9,87T at 123M/s, 22h29m to go
>         41,4G resilvered, 3,65% done
> config:

> 	NAME                  STATE     READ WRITE CKSUM
> 	spool                 ONLINE       0     0     0
> 	  raidz1-0            ONLINE       0     0     0
> 	    ada0              ONLINE       0     0     0
> 	    ada1              ONLINE       0     0     0
> 	    spare-2           ONLINE       0     0   733
> 	      ada11           ONLINE       0     0     0
> 	      ada2            ONLINE       0     0     0  (resilvering)
> 	  raidz1-1            ONLINE       0     0     0
> 	    ada3              ONLINE       0     0     0
> 	    ada4              ONLINE       0     0     0
> 	    ada5              ONLINE       0     0     0
> 	  raidz1-2            ONLINE       0     0     0
> 	    ada6              ONLINE       0     0     0
> 	    ada7              ONLINE       0     0     0
> 	    ada10             ONLINE       0     0     0
> 	spares
> 	  588540573008830286  INUSE     was /dev/ada2

> errors: No known data errors
> ---

> `resilver in progress since' date is constantly reset, so resilved
> progress cannot pass beyond 5% or so. My guess is that it happens on
> read errors on ada11. I think I've seen (resilvering) on ada11 line
> couple of times.

> In the end I've had to offline ada11 and after that resilver completed
> in under 16 hours. However the situation doesn't seem normal, as I'd
> prefer to not lose redundancy with offlining dying disk and still be
> able to use it for resilvering (imagine there were bad sectors on ada0/1
> as well, but not intersecting with bad sectors on ada11), or at least
> some more verbose indication of why the resilver is constantly restarted.

> I should also note that's outdated FreeBSD 9.1, so maybe that problem
> was fixed already.

We have been dealing with, what seems to be, the same issue on 11.0-RELEASE with
a two-raidz1-vdev pool.  You said that your issue was with a raidz2, but your
zpool status output shows raidz1.  The problem disk had checksum mismatches and
smart was reporting errors, but it was still online.  The resilver would make it
through many hours, but then restart.  This loop went on for a few days.  As in
your case, after offlining the problem disk, the replacement finished.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 930 bytes
Desc: not available
URL: <http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20161228/df9a5549/attachment.sig>