From nobody Sat Jan 14 15:36:22 2023 X-Original-To: freebsd-fs@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4NvMpQ2Lsbz2r4qk for ; Sat, 14 Jan 2023 15:36:46 +0000 (UTC) (envelope-from freebsd@vanderzwan.org) Received: from mail2.paztec.nl (mail2.paztec.nl [IPv6:2001:1af8:4700:a116:1::42]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "mail2.paztec.nl", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4NvMpQ0KBlz3kpQ for ; Sat, 14 Jan 2023 15:36:45 +0000 (UTC) (envelope-from freebsd@vanderzwan.org) Authentication-Results: mx1.freebsd.org; none X-Bogosity: Unsure, tests=bogofilter Received: from smtpclient.apple (2a02-a461-283f-4--22.fixed6.kpn.net [IPv6:2a02:a461:283f:4:0:0:0:22]) (authenticated bits=0) by mail2.paztec.nl (8.17.1/8.16.1) with ESMTPSA id 30EFaWoj068300 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Sat, 14 Jan 2023 16:36:38 +0100 (CET) (envelope-from freebsd@vanderzwan.org) X-Authentication-Warning: vps4.vanderzwan.org: Host 2a02-a461-283f-4--22.fixed6.kpn.net [IPv6:2a02:a461:283f:4:0:0:0:22] claimed to be smtpclient.apple From: freebsd@vanderzwan.org Message-Id: <17D411EE-9815-44FC-A135-68EBB53B2D50@vanderzwan.org> Content-Type: multipart/alternative; boundary="Apple-Mail=_FDEF320D-60B9-4159-A0D2-BD344DB94167" List-Id: Filesystems List-Archive: https://lists.freebsd.org/archives/freebsd-fs List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-fs@freebsd.org Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3731.300.101.1.3\)) Subject: Re: ZFS checksum error on 2 disks of mirror Date: Sat, 14 Jan 2023 16:36:22 +0100 In-Reply-To: Cc: freebsd-fs To: milky india References: X-Mailer: Apple Mail (2.3731.300.101.1.3) X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED,SHORTCIRCUIT shortcircuit=ham autolearn=disabled version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on vps4.vanderzwan.org X-Rspamd-Queue-Id: 4NvMpQ0KBlz3kpQ X-Spamd-Bar: ---- X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[]; ASN(0.00)[asn:60781, ipnet:2001:1af8::/32, country:NL] X-Rspamd-Pre-Result: action=no action; module=replies; Message is reply to one we originated X-ThisMailContainsUnwantedMimeParts: N --Apple-Mail=_FDEF320D-60B9-4159-A0D2-BD344DB94167 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 Hi > On 14 Jan 2023, at 16:29, milky india wrote: >=20 > > No panics on my system, it just kept running. And there is no way = that I know of to repoduce it. >=20 > Yes (not being able to) reproducing issues is a huge problem. > When the scrub was producing the error do you remember the exact error = message or have it recorded? >=20 Scrub did not give any errors. Zpool status -v showed one file with an = error but that was also gone after the scrub. So no evidence of any error except for what was logged in = /var/log/messages remains. > In this case it was a meta data level corruption error that lead to = https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A/ which seemed = like a dead end, or in your case at least ensuring things are backed up = in case the issue arises later. >=20 Scrub is finding no errors so I think the pool and data should be = healthy. Scrubbing all pools roughly every 4 weeks so I=E2=80=99ll notice if that = changes. Paul > Ultimately if its zfs > On Sat, Jan 14, 2023, 19:13 > wrote: >>=20 >>=20 >>> On 14 Jan 2023, at 15:57, milky india > wrote: >>>=20 >>> > Output of zpool status -v gives no read/write/cksum errors but = lists one file with an error. >>> Had faced a similar issue, when I tried to delete the file the error = still persisted, although realised it after a few shutdown cycles >>=20 >> For me after a scrub there was no more mention of a file with an = error so I assume the error was transient. >>=20 >>>=20 >>> >After running a scrub on the pool all seems to be well, no more = files with errors. >>> Please monitor if the error shows up again sometime soon. While I = don't know what the issue is but zfs error no 97 seems like a serious = bug.=20 >>>=20 >> Definitely keeping a close look for this. >>=20 >>> Is this a similar issue for which PR is open? = https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D268333=20 >>>=20 >>=20 >> No panics on my system, it just kept running. And there is no way = that I know of to repoduce it. >>=20 >> At the moment I suspect it was the power grid issue we had the night = that error was logged. >> Large part of the city where I live had an outage after a fire in a = substation. >> I only had a dip for about 1s when it happened but this server did = need a reboot as it was unresponsive. >>=20 >> The time of the error roughly matches the time they started = restoring power to the affected parts of the city. >> Maybe that created another event on the grid. >>=20 >> The server is not behind a UPS as power grid is usually very reliable = here in the Netherlands. >>=20 >> Paul >>=20 >> =20 >>> On Fri, Jan 13, 2023, 19:35 > wrote: >>>> Hi, >>>> I noticed zpool status gave an error for one of my pools. >>>> Looking back in the logs I found thus: >>>>=20 >>>> Dec 24 00:58:39 freebsd ZFS[40537]: pool I/O failure, = zpool=3Dbackuppool error=3D97 >>>> Dec 24 00:58:39 freebsd ZFS[40541]: checksum mismatch, = zpool=3Dbackuppool path=3D/dev/gpt/VGJL4JYGp2 offset=3D1634427084800 = size=3D53248 >>>> Dec 24 00:58:39 freebsd ZFS[40545]: checksum mismatch, = zpool=3Dbackuppool path=3D/dev/gpt/VGJKNA9Gp2 offset=3D1634427084800 = size=3D53248 >>>>=20 >>>> These are 2 WD Red Plus 8TB drives (same age, same firmware, = attached to same controller). >>>>=20 >>>> Looking back in the logs I found this occurred earlier without me = noticing: >>>>=20 >>>> Aug 8 03:17:56 freebsd ZFS[12328]: pool I/O failure, = zpool=3Dbackuppool error=3D97 >>>> Aug 8 03:17:56 freebsd ZFS[12332]: checksum mismatch, = zpool=3Dbackuppool path=3D/dev/gpt/VGJL4JYGp2 offset=3D4056214130688 = size=3D131072 >>>> Aug 8 03:17:56 freebsd ZFS[12336]: checksum mismatch, = zpool=3Dbackuppool path=3D/dev/gpt/VGJKNA9Gp2 offset=3D4056214130688 = size=3D131072 >>>> Aug 8 13:37:26 freebsd ZFS[22317]: pool I/O failure, = zpool=3Dbackuppool error=3D97 >>>> Aug 8 13:37:26 freebsd ZFS[22321]: checksum mismatch, = zpool=3Dbackuppool path=3D/dev/gpt/VGJKNA9Gp2 offset=3D4056214130688 = size=3D131072 >>>> Aug 8 13:37:26 freebsd ZFS[22325]: checksum mismatch, = zpool=3Dbackuppool path=3D/dev/gpt/VGJL4JYGp2 offset=3D4056214130688 = size=3D131072 >>>> Aug 8 15:37:44 freebsd ZFS[24704]: pool I/O failure, = zpool=3Dbackuppool error=3D97 >>>> Aug 8 15:37:44 freebsd ZFS[24708]: checksum mismatch, = zpool=3Dbackuppool path=3D/dev/gpt/VGJL4JYGp2 offset=3D4056214130688 = size=3D131072 >>>> Aug 8 15:37:44 freebsd ZFS[24712]: checksum mismatch, = zpool=3Dbackuppool path=3D/dev/gpt/VGJKNA9Gp2 offset=3D4056214130688 = size=3D131072 >>>>=20 >>>> Output of zpool status -v gives no read/write/cksum errors but = lists one file with an error. >>>>=20 >>>> After running a scrub on the pool all seems to be well, no more = files with errors. >>>>=20 >>>> System is a homebuilt with Asrock Rack C2550 board with 16 GB of = ECC RAM >>>> Any idea how I could get checksum errors on the identical block of = 2 disks in a mirror ? >>>>=20 >>>> Regards, >>>> Paul >>=20 --Apple-Mail=_FDEF320D-60B9-4159-A0D2-BD344DB94167 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8 Hi


On = 14 Jan 2023, at 16:29, milky india <milkyindia@gmail.com> = wrote:

> No panics on my system, it just kept running. And = there is no way that I know of to repoduce it.

Yes (not being = able to) reproducing issues is a huge problem.
When= the scrub was producing the error do you remember the exact error = message or have it recorded?


Scrub did = not give any errors. Zpool status -v showed one file with an error but = that was also gone after the scrub.
So no evidence of any = error except for what was logged in /var/log/messages = remains.

In this case it was a meta data level = corruption error that lead to https://o= penzfs.github.io/openzfs-docs/msg/ZFS-8000-8A/ which seemed like a = dead end, or in your case at least ensuring things are backed up in case = the issue arises later.


Scrub is = finding no errors so I think the pool and data should be = healthy.

Scrubbing all pools roughly every 4 = weeks so I=E2=80=99ll notice if that = changes.

Paul

Ultimately if its = zfs
On Sat, Jan 14, 2023, 19:13 <freebsd@vanderzwan.org> = wrote:


On 14 Jan 2023, at 15:57, milky india <milkyindia@gmail.com> = wrote:

Output of zpool status -v = gives no read/write/cksum errors  but lists one file with an = error.
Had faced a similar issue, when I = tried to delete the file the error still persisted, although realised it = after a few shutdown = cycles

For me after a scrub = there was no more mention of a file with an error so I assume the error = was transient.


>After = running a scrub on the pool all seems to be well, no more files with = errors.
Please monitor if the error shows up = again sometime soon. While I don't know what the issue is but zfs error = no 97 seems like a serious bug. 

Definitely keeping a = close look for this.

Is this a similar issue for which PR is = open? https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D268= 333 


No panics = on my system, it just kept running. And there is no way that I know of = to repoduce it.

At the moment I suspect it was = the power grid  issue we had the night that error was = logged.
Large part of the city where I live had an outage = after a fire in a substation.
I  only had a dip for about = 1s when it happened but this server did need a reboot as it was = unresponsive.

The time of the error roughly = matches the time  they started restoring power to the affected = parts of the city.
Maybe that created another event on the = grid.

The server is not behind a UPS as power = grid is usually very reliable here in the = Netherlands.

= Paul

 
On Fri, Jan 13, 2023, 19:35 <freebsd@vanderzwan.org> = wrote:
Hi,
I noticed zpool status = gave an error for one of my pools.
Looking back in the = logs I found thus:

Dec 24 00:58:39 freebsd = ZFS[40537]: pool I/O failure, zpool=3Dbackuppool error=3D97
Dec 24 00:58:39 freebsd = ZFS[40541]: checksum mismatch, zpool=3Dbackuppool = path=3D/dev/gpt/VGJL4JYGp2 offset=3D1634427084800 size=3D53248
Dec 24 00:58:39 freebsd = ZFS[40545]: checksum mismatch, zpool=3Dbackuppool = path=3D/dev/gpt/VGJKNA9Gp2 offset=3D1634427084800 size=3D53248

These are 2 WD Red Plus = 8TB drives (same age, same firmware, attached to same = controller).

Looking back in the = logs I found this occurred earlier without me noticing:

Aug  8 03:17:56 = freebsd ZFS[12328]: pool I/O failure, zpool=3Dbackuppool = error=3D97
Aug  8 03:17:56 = freebsd ZFS[12332]: checksum mismatch, zpool=3Dbackuppool = path=3D/dev/gpt/VGJL4JYGp2 offset=3D4056214130688 size=3D131072
Aug  8 03:17:56 = freebsd ZFS[12336]: checksum mismatch, zpool=3Dbackuppool = path=3D/dev/gpt/VGJKNA9Gp2 offset=3D4056214130688 size=3D131072
Aug  8 13:37:26 = freebsd ZFS[22317]: pool I/O failure, zpool=3Dbackuppool = error=3D97
Aug  8 13:37:26 = freebsd ZFS[22321]: checksum mismatch, zpool=3Dbackuppool = path=3D/dev/gpt/VGJKNA9Gp2 offset=3D4056214130688 size=3D131072
Aug  8 13:37:26 = freebsd ZFS[22325]: checksum mismatch, zpool=3Dbackuppool = path=3D/dev/gpt/VGJL4JYGp2 offset=3D4056214130688 size=3D131072
Aug  8 15:37:44 = freebsd ZFS[24704]: pool I/O failure, zpool=3Dbackuppool = error=3D97
Aug  8 15:37:44 = freebsd ZFS[24708]: checksum mismatch, zpool=3Dbackuppool = path=3D/dev/gpt/VGJL4JYGp2 offset=3D4056214130688 size=3D131072
Aug  8 15:37:44 = freebsd ZFS[24712]: checksum mismatch, zpool=3Dbackuppool = path=3D/dev/gpt/VGJKNA9Gp2 offset=3D4056214130688 size=3D131072

Output of zpool status = -v gives no read/write/cksum errors  but lists one file with an = error.

After running a scrub = on the pool all seems to be well, no more files with errors.

System is a homebuilt = with Asrock Rack C2550 board with 16 GB of ECC RAM
Any idea how I could = get checksum errors on the identical block of 2 disks in a mirror = ?

Regards,
Paul


= --Apple-Mail=_FDEF320D-60B9-4159-A0D2-BD344DB94167--