ZFS errors on the array but not the disk.

Fri Oct 24 15:33:25 UTC 2014

On Thu, Oct 23, 2014 at 11:37 PM, Zaphod Beeblebrox <zbeeble at gmail.com> wrote:
> What does it mean when checksum errors appear on the array (and the vdev)
> but not on any of the disks?  See the paste below.  One would think that
> there isn't some ephemeral data stored somewhere that is not one of the
> disks, yet "cksum" errors show only on the vdev and the array lines.  Help?
>
> [2:17:316]root at virtual:/vr2/torrent/in> zpool status
>   pool: vr2
>  state: ONLINE
> status: One or more devices is currently being resilvered.  The pool will
>         continue to function, possibly in a degraded state.
> action: Wait for the resilver to complete.
>   scan: resilver in progress since Thu Oct 23 23:11:29 2014
>         1.53T scanned out of 22.6T at 62.4M/s, 98h23m to go
>         119G resilvered, 6.79% done
> config:
>
>         NAME               STATE     READ WRITE CKSUM
>         vr2                ONLINE       0     0    36
>           raidz1-0         ONLINE       0     0    72
>             label/vr2-d0   ONLINE       0     0     0
>             label/vr2-d1   ONLINE       0     0     0
>             gpt/vr2-d2c    ONLINE       0     0     0  block size: 512B
> configured, 4096B native  (resilvering)
>             gpt/vr2-d3b    ONLINE       0     0     0  block size: 512B
> configured, 4096B native
>             gpt/vr2-d4a    ONLINE       0     0     0  block size: 512B
> configured, 4096B native
>             ada14          ONLINE       0     0     0
>             label/vr2-d6   ONLINE       0     0     0
>             label/vr2-d7c  ONLINE       0     0     0
>             label/vr2-d8   ONLINE       0     0     0
>           raidz1-1         ONLINE       0     0     0
>             gpt/vr2-e0     ONLINE       0     0     0  block size: 512B
> configured, 4096B native
>             gpt/vr2-e1     ONLINE       0     0     0  block size: 512B
> configured, 4096B native
>             gpt/vr2-e2     ONLINE       0     0     0  block size: 512B
> configured, 4096B native
>             gpt/vr2-e3     ONLINE       0     0     0
>             gpt/vr2-e4     ONLINE       0     0     0  block size: 512B
> configured, 4096B native
>             gpt/vr2-e5     ONLINE       0     0     0  block size: 512B
> configured, 4096B native
>             gpt/vr2-e6     ONLINE       0     0     0  block size: 512B
> configured, 4096B native
>             gpt/vr2-e7     ONLINE       0     0     0  block size: 512B
> configured, 4096B native
>
> errors: 43 data errors, use '-v' for a list

The checksum errors will appear on the raidz vdev instead of a leaf if
vdev_raidz.c can't determine which leaf vdev was responsible.  This
could happen if two or more leaf vdevs return bad data for the same
block, which would also lead to unrecoverable data errors.  I see that
you have some unrecoverable data errors, so maybe that's what happened
to you.

Subtle design bugs in ZFS can also lead to vdev_raidz.c being unable
to determine which child was responsible for a checksum error.
However, I've only seen that happen when a raidz vdev has a mirror
child.  That can only happen if the child is a spare or replacing
vdev.  Did you activate any spares, or did you manually replace a
vdev?

-Alan