Re: ZFS checksum error on 2 disks of mirror

From: <freebsd_at_vanderzwan.org>
Date: Sat, 14 Jan 2023 13:09:14 UTC
Hi Rich
Looking back at the logs I noticed the error was logged in the night we had a major poweroutage ( fire in substation)  very close by.
We had a dip of about 1s that caused the server to need a reboot when outage started.
The time the error was logged was about the time power was restored.
Maybe that caused another event in the grid where live, not bad enough for the server to hang/reboot but maybe it caused the checksum error.

	Paul




> On 14 Jan 2023, at 10:57, Rich <rincebrain@gmail.com> wrote:
> 
> If you haven't rebooted since it transpired, could you share zpool events -v? It can, depending on version, contain information like what checksum bits disagreed, which can help you figure out if it's a random bit flip before it was written, or a more systemic mangling error on both disks, or something stranger yet...
> 
> Curious the lack of r/w/c errors, though, I don't immediately know of any case that would do that outside of decryption problems...
> 
> - Rich
> 
> On Sat, Jan 14, 2023 at 3:30 AM <freebsd@vanderzwan.org> wrote:
> 
> 
> > On 13 Jan 2023, at 20:51, Rich <rincebrain@gmail.com> wrote:
> > 
> > Offhand, the easiest way I know of to get an IO error without a r/w/c error showing up in zpool status is for a decryption failure using ZFS native encryption, so if you are using that on the pool somewhere, I would suspect it tried and failed at decrypting something from both copies and logged it that way.
> > 
> > I wouldn't expect it to log it that way, but that's my best guess.
> 
> Hi
> 
> No native or geli encryption is in use on that system. So that can be ruled out.
> 
> Both disks in the vdev are same brand/model/firmware, maybe a firmware bug ?
> Model Family:     Western Digital Red
> Device Model:     WDC WD80EFAX-68KNBN0
> Firmware Version: 81.00A81
> 
> BTW No IO/SATA errors were logged when the ZFS errors were logged, so it looks like a ZFS specific error…
> 
>         Paul
> 
> > 
> > On Fri, Jan 13, 2023 at 10:35 AM <freebsd@vanderzwan.org> wrote:
> > Hi,
> > I noticed zpool status gave an error for one of my pools.
> > Looking back in the logs I found thus:
> > 
> > Dec 24 00:58:39 freebsd ZFS[40537]: pool I/O failure, zpool=backuppool error=97
> > Dec 24 00:58:39 freebsd ZFS[40541]: checksum mismatch, zpool=backuppool path=/dev/gpt/VGJL4JYGp2 offset=1634427084800 size=53248
> > Dec 24 00:58:39 freebsd ZFS[40545]: checksum mismatch, zpool=backuppool path=/dev/gpt/VGJKNA9Gp2 offset=1634427084800 size=53248
> > 
> > These are 2 WD Red Plus 8TB drives (same age, same firmware, attached to same controller).
> > 
> > Looking back in the logs I found this occurred earlier without me noticing:
> > 
> > Aug  8 03:17:56 freebsd ZFS[12328]: pool I/O failure, zpool=backuppool error=97
> > Aug  8 03:17:56 freebsd ZFS[12332]: checksum mismatch, zpool=backuppool path=/dev/gpt/VGJL4JYGp2 offset=4056214130688 size=131072
> > Aug  8 03:17:56 freebsd ZFS[12336]: checksum mismatch, zpool=backuppool path=/dev/gpt/VGJKNA9Gp2 offset=4056214130688 size=131072
> > Aug  8 13:37:26 freebsd ZFS[22317]: pool I/O failure, zpool=backuppool error=97
> > Aug  8 13:37:26 freebsd ZFS[22321]: checksum mismatch, zpool=backuppool path=/dev/gpt/VGJKNA9Gp2 offset=4056214130688 size=131072
> > Aug  8 13:37:26 freebsd ZFS[22325]: checksum mismatch, zpool=backuppool path=/dev/gpt/VGJL4JYGp2 offset=4056214130688 size=131072
> > Aug  8 15:37:44 freebsd ZFS[24704]: pool I/O failure, zpool=backuppool error=97
> > Aug  8 15:37:44 freebsd ZFS[24708]: checksum mismatch, zpool=backuppool path=/dev/gpt/VGJL4JYGp2 offset=4056214130688 size=131072
> > Aug  8 15:37:44 freebsd ZFS[24712]: checksum mismatch, zpool=backuppool path=/dev/gpt/VGJKNA9Gp2 offset=4056214130688 size=131072
> > 
> > Output of zpool status -v gives no read/write/cksum errors  but lists one file with an error.
> > 
> > After running a scrub on the pool all seems to be well, no more files with errors.
> > 
> > System is a homebuilt with Asrock Rack C2550 board with 16 GB of ECC RAM
> > Any idea how I could get checksum errors on the identical block of 2 disks in a mirror ?
> > 
> > Regards,
> > Paul
>