Re: ZFS checksum error on 2 disks of mirror

From: milky india <milkyindia_at_gmail.com>
Date: Sat, 14 Jan 2023 15:42:48 UTC
> Scrub is finding no errors so I think the pool and data should be healthy.


Yes that's what I assumed as well only to later discover it wasn't ok.

>Scrubbing all pools roughly every 4 weeks so I’ll notice if that changes.

Would probably do it sooner and a couple of scrubs across a couple of
reboots , just to be doubly sure. I hope nothing bad comes of it and you
have your peace of mind later.

PS:  Sorry if it feels like I'm insisting but had a bad experience with
this bug.

On Sat, Jan 14, 2023, 19:36 <freebsd@vanderzwan.org> wrote:

> Hi
>
>
> On 14 Jan 2023, at 16:29, milky india <milkyindia@gmail.com> wrote:
>
> > No panics on my system, it just kept running. And there is no way that I
> know of to repoduce it.
>
> Yes (not being able to) reproducing issues is a huge problem.
> When the scrub was producing the error do you remember the exact error
> message or have it recorded?
>
>
> Scrub did not give any errors. Zpool status -v showed one file with an
> error but that was also gone after the scrub.
> So no evidence of any error except for what was logged in
> /var/log/messages remains.
>
> In this case it was a meta data level corruption error that lead to
> https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A/ which seemed like
> a dead end, or in your case at least ensuring things are backed up in case
> the issue arises later.
>
>
> Scrub is finding no errors so I think the pool and data should be healthy.
>
> Scrubbing all pools roughly every 4 weeks so I’ll notice if that changes.
>
> Paul
>
> Ultimately if its zfs
> On Sat, Jan 14, 2023, 19:13 <freebsd@vanderzwan.org> wrote:
>
>>
>>
>> On 14 Jan 2023, at 15:57, milky india <milkyindia@gmail.com> wrote:
>>
>> > Output of zpool status -v gives no read/write/cksum errors  but lists
>> one file with an error.
>> Had faced a similar issue, when I tried to delete the file the error
>> still persisted, although realised it after a few shutdown cycles
>>
>>
>> For me after a scrub there was no more mention of a file with an error so
>> I assume the error was transient.
>>
>>
>> >After running a scrub on the pool all seems to be well, no more files
>> with errors.
>> Please monitor if the error shows up again sometime soon. While I don't
>> know what the issue is but zfs error no 97 seems like a serious bug.
>>
>> Definitely keeping a close look for this.
>>
>> Is this a similar issue for which PR is open?
>> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=268333
>>
>>
>> No panics on my system, it just kept running. And there is no way that I
>> know of to repoduce it.
>>
>> At the moment I suspect it was the power grid  issue we had the night
>> that error was logged.
>> Large part of the city where I live had an outage after a fire in a
>> substation.
>> I  only had a dip for about 1s when it happened but this server did need
>> a reboot as it was unresponsive.
>>
>> The time of the error roughly matches the time  they started restoring
>> power to the affected parts of the city.
>> Maybe that created another event on the grid.
>>
>> The server is not behind a UPS as power grid is usually very reliable
>> here in the Netherlands.
>>
>> Paul
>>
>>
>>
>> On Fri, Jan 13, 2023, 19:35 <freebsd@vanderzwan.org> wrote:
>>
>>> Hi,
>>> I noticed zpool status gave an error for one of my pools.
>>> Looking back in the logs I found thus:
>>>
>>> Dec 24 00:58:39 freebsd ZFS[40537]: pool I/O failure, zpool=backuppool
>>> error=97
>>> Dec 24 00:58:39 freebsd ZFS[40541]: checksum mismatch, zpool=backuppool
>>> path=/dev/gpt/VGJL4JYGp2 offset=1634427084800 size=53248
>>> Dec 24 00:58:39 freebsd ZFS[40545]: checksum mismatch, zpool=backuppool
>>> path=/dev/gpt/VGJKNA9Gp2 offset=1634427084800 size=53248
>>>
>>> These are 2 WD Red Plus 8TB drives (same age, same firmware, attached to
>>> same controller).
>>>
>>> Looking back in the logs I found this occurred earlier without me
>>> noticing:
>>>
>>> Aug  8 03:17:56 freebsd ZFS[12328]: pool I/O failure, zpool=backuppool
>>> error=97
>>> Aug  8 03:17:56 freebsd ZFS[12332]: checksum mismatch, zpool=backuppool
>>> path=/dev/gpt/VGJL4JYGp2 offset=4056214130688 size=131072
>>> Aug  8 03:17:56 freebsd ZFS[12336]: checksum mismatch, zpool=backuppool
>>> path=/dev/gpt/VGJKNA9Gp2 offset=4056214130688 size=131072
>>> Aug  8 13:37:26 freebsd ZFS[22317]: pool I/O failure, zpool=backuppool
>>> error=97
>>> Aug  8 13:37:26 freebsd ZFS[22321]: checksum mismatch, zpool=backuppool
>>> path=/dev/gpt/VGJKNA9Gp2 offset=4056214130688 size=131072
>>> Aug  8 13:37:26 freebsd ZFS[22325]: checksum mismatch, zpool=backuppool
>>> path=/dev/gpt/VGJL4JYGp2 offset=4056214130688 size=131072
>>> Aug  8 15:37:44 freebsd ZFS[24704]: pool I/O failure, zpool=backuppool
>>> error=97
>>> Aug  8 15:37:44 freebsd ZFS[24708]: checksum mismatch, zpool=backuppool
>>> path=/dev/gpt/VGJL4JYGp2 offset=4056214130688 size=131072
>>> Aug  8 15:37:44 freebsd ZFS[24712]: checksum mismatch, zpool=backuppool
>>> path=/dev/gpt/VGJKNA9Gp2 offset=4056214130688 size=131072
>>>
>>> Output of zpool status -v gives no read/write/cksum errors  but lists
>>> one file with an error.
>>>
>>> After running a scrub on the pool all seems to be well, no more files
>>> with errors.
>>>
>>> System is a homebuilt with Asrock Rack C2550 board with 16 GB of ECC RAM
>>> Any idea how I could get checksum errors on the identical block of 2
>>> disks in a mirror ?
>>>
>>> Regards,
>>> Paul
>>>
>>
>>
>