From nobody Tue Aug 20 08:10:13 2024 X-Original-To: freebsd-fs@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4Wp2Fn4y5bz5SbX2 for ; Tue, 20 Aug 2024 08:10:21 +0000 (UTC) (envelope-from SRS0=aXLp=PT=klop.ws=ronald-lists@realworks.nl) Received: from smtp-relay-int.realworks.nl (smtp-relay-int.realworks.nl [194.109.157.24]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4Wp2Fn0GV3z4m8K for ; Tue, 20 Aug 2024 08:10:20 +0000 (UTC) (envelope-from SRS0=aXLp=PT=klop.ws=ronald-lists@realworks.nl) Authentication-Results: mx1.freebsd.org; none Date: Tue, 20 Aug 2024 10:10:13 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=klop.ws; s=rw2; t=1724141413; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to; bh=WQ4dtxkGeSAOGGwq1TD59+Zxr6FB6ll3n3V9skRWayY=; b=ASEJjdMGkf5W+oOa8FdBp6in2j1ypKWh0fp6y6AzezuI3UzCxACfykXM5ypgChux2AhyRJ Duq+fedxCLaz+Q3Fh3j9VLp5FEl4gC4oykZLgOEl2U7F7IjrBatkQuIzA4zJyxm2zZ5Bgj 0bd+3apl3lhbk9ePCGhA3IzCWZ8Z8vDBuN0v/jQ4RlFmzyT17nHgaHax31EiFrTK44TJxY d+AT3zip96aREYzX8We2v9qWk6D7zuoRuAn3lePZephW/torOD5RftRrXapPOB9IlBMJVe 7boxRs48g+Ga1M0QX+dHgd+NB4mvQIi3VYezuPded+VN1MrpX/SZe5kOs0MnyQ== From: Ronald Klop To: Pamela Ballantyne Cc: freebsd-fs@freebsd.org Message-ID: <146942683.2571.1724141413478@localhost> In-Reply-To: Subject: Re: ZFS: Suspended Pool due to allegedly uncorrectable I/O error List-Id: Filesystems List-Archive: https://lists.freebsd.org/archives/freebsd-fs List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-fs@FreeBSD.org MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_2570_13676940.1724141413473" X-Mailer: Realworks (716.31) Importance: Normal X-Priority: 3 (Normal) X-Spamd-Bar: ---- X-Rspamd-Pre-Result: action=no action; module=replies; Message is reply to one we originated X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[]; ASN(0.00)[asn:3265, ipnet:194.109.0.0/16, country:NL] X-Rspamd-Queue-Id: 4Wp2Fn0GV3z4m8K ------=_Part_2570_13676940.1724141413473 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable Hi, This happens on my Raspberry pi when usb senses a disconnect of the disk.It= does have a message about this usb event on the serial console but because= of the repetition of other messages it goes off screen quickly. And as the= disk is unavailable it didn=E2=80=99t write anything locally. Remote loggi= ng could help.=20 I just did =E2=80=98zpool set failmode=3Dpanic zroot=E2=80=99 to at least h= ave a workable system if it happens.login via serial/ipmi doesn=E2=80=99t w= ork either because you don=E2=80=99t have any executable available to run Regards,Ronald. Van: Pamela Ballantyne Datum: 19 augustus 2024 23:20 Aan: freebsd-fs@freebsd.org Onderwerp: ZFS: Suspended Pool due to allegedly uncorrectable I/O error >=20 >=20 >=20 >=20 > Hi, >=20 > So, this is long, so here's TL;DR: ZFS suspended a pool for presumably > good reasons, but on reboot, there didn't seem to be any good reason for = it. >=20 > As a background, I'm an early ZFS adopter of ZFS. I have a remote server = running ZFS >=20 > continuously since late 2010, 24x7. I also use ZFS on my home machines. W= hile I do not > claim to be a ZFS expert, I've managed to handle the various issues that = have come up over=20 > the years and haven't had to ask for help from the experts. >=20 > But now I am completely baffled and would appreciate any help, advice, po= inters, links, whatever. >=20 > On Sunday Morning, 08/11, I upgraded the server from 12.4-RELEASE-p9 to 1= 3.3-RELEASE-p5. > The upgrade went smoothly; there was no problem, and the server worked fl= awlessly post-upgrade. >=20 > On Thursday evening, 8/15, the server became unreachable. It would still = respond to pings via=20 > the IP address, but that was it. I used to be able to access the server = via IPMI, but that ability disappeared > several company mergers ago. The current NOC staff sent me a screenshot o= f the server output, > which showed repeated messages saying: >=20 > "Solaris: WARNING: Pool 'zroot' has encountered an uncorrectable I/O fail= ure and has been suspended." >=20 >=20 > There had been no warnings in the log files, nothing. There was no sign f= rom the S.M.A.R.T. monitoring system, nothing. >=20 > It's a simple mirrored setup with just two drives. So I expected a catast= rophic hardware failure. Maybe the HBA had=20 > failed (this is on a SuperMicro Blade server), or both drives had manage = to die at the same time.=20 >=20 > Without any way to log in remotely, I requested a reboot. The server reb= ooted without errors. I could > ssh into my account and poke around. Everything was normal. There were n= o log entries related to the crash. I realize post-crash > there would be no filesystem to write to, but there was still nothing lea= ding up to it - no hardware or disk-related > messages of any kind. The only sign of any problem I could find was 2 ch= ecksum errors listed on only one of the > drives in the mirror when I did zpool status. >=20 > I ran a scrub, which completed without any problem or error. About 30 min= utes after the scrub, the=20 > two checksum errors disappeared without manually clearing them. I've run = some drive tests and > they both pass with flying colors. And it's now been a few days and the s= ystem has been performing flawlessly. >=20 > So, I am completely flummoxed. I am trying to understand why the pool was= suspended when it looks like > something ZFS should have easily handled. I've had complete drive failure= s, and ZFS just kept on going. > Is there any bug or incompatibility in 13.3-p5? Is this something that w= ill recur on each full moon? >=20 > So thanks in advance for any advice, shared experiences, or whatever you = can offer. >=20 > Best, > Pammy >=20 >=20 >=20 >=20 >=20 ------=_Part_2570_13676940.1724141413473 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: quoted-printable Hi,

This happens on my R= aspberry pi when usb senses a disconnect of the disk.
It does have a message about this usb event on the serial console but beca= use of the repetition of other messages it goes off screen quickly. And as the disk is unavailable it didn=E2=80=99t write= anything locally. Remote logging could help. 
<= br>I just did =E2=80=98zpool set failmode=3Dpanic zroot=E2=80=99 to at leas= t have a workable system if it happens.
login via ser= ial/ipmi doesn=E2=80=99t work either because you don=E2=80=99t have any exe= cutable available to run

Regards,
Ronald.

Van: Pa= mela Ballantyne <boyvalue@gmail.com>
Datum: 19 au= gustus 2024 23:20
Aan: freebsd-fs@freebsd.org
Onderwerp: ZFS: Suspended Pool due to allegedly uncorrectable I/= O error

Hi,

So, this is lo= ng, so here's TL;DR:  ZFS suspended a pool for presumably
good reasons, but on reboot, there didn't seem to be any good reas= on for it.

As a background, = I'm an early ZFS adopter of ZFS. I have a remote server running ZFS
continuously since late 2010, 24x7. I also use ZFS on my = home machines. While I do not
claim to be a ZFS expert= , I've managed to handle the various issues that have come up over 
the years and haven't had to ask for help from the exper= ts.

But now I am completely = baffled and would appreciate any help, advice, pointers, links, whatever.

On Sunday Morning, 08/11, I u= pgraded the server from 12.4-RELEASE-p9 to 13.3-RELEASE-p5.
The upgrade went smoothly; there was no problem, and the server worke= d flawlessly post-upgrade.

O= n Thursday evening, 8/15, the server became unreachable. It would still res= pond to pings via 
the IP address, but that was i= t.  I used to be able to access the server via IPMI, but that ability = disappeared
several company mergers ago. The current N= OC staff sent me a screenshot of the server output,
wh= ich showed repeated messages saying:

"Solaris: WARNING: Pool 'zroot' has encountered an uncorrectable I= /O failure and has been suspended."

There had been no warnings in the log files, nothing. There was= no sign from the S.M.A.R.T. monitoring system, nothing.

It's a simple mirrored setup with just two dri= ves. So I expected a catastrophic hardware failure. Maybe the HBA had =
failed (this is on a SuperMicro Blade server), or bot= h drives had manage to die at the same time. 
Without any way to log in remotely, I requested a re= boot.  The server rebooted without errors. I could
ssh into my account and poke around.  Everything was normal. There w= ere no log entries related to the crash. I realize post-crash
there would be no filesystem to write to, but there was still nothi= ng leading up to it - no hardware or disk-related
mess= ages of any kind.  The only sign of any problem I could find was 2 che= cksum errors listed on only one of the
drives in the m= irror when I did zpool status.

I ran a scrub, which completed without any problem or error. About 30 mi= nutes after the scrub, the 
two checksum errors d= isappeared without manually clearing them. I've run some drive tests and
they both pass with flying colors. And it's now been a f= ew days and the system has been performing flawlessly.

So, I am completely flummoxed. I am trying = to understand why the pool was suspended when it looks like
something ZFS should have easily handled. I've had complete drive fai= lures, and ZFS just kept on going.
Is there any bug or= incompatibility in 13.3-p5?  Is this something that will recur on eac= h full moon?

So thanks in ad= vance for any advice, shared experiences, or whatever you can offer.
<= div class=3D"">
Best,
Pammy





------=_Part_2570_13676940.1724141413473--