From nobody Tue Sep 17 11:16:20 2024 X-Original-To: questions@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4X7K3c3dWZz595SK for ; Tue, 17 Sep 2024 11:16:28 +0000 (UTC) (envelope-from freebsd-doc@fjl.co.uk) Received: from bs2.fjl.org.uk (bs2.fjl.org.uk [84.45.41.208]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "bs2.fjl.org.uk", Issuer "bs2.fjl.org.uk" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 4X7K3b3zV3z4FTY for ; Tue, 17 Sep 2024 11:16:27 +0000 (UTC) (envelope-from freebsd-doc@fjl.co.uk) Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=none; spf=pass (mx1.freebsd.org: domain of freebsd-doc@fjl.co.uk designates 84.45.41.208 as permitted sender) smtp.mailfrom=freebsd-doc@fjl.co.uk Received: from roundcube.fjl.uk ([192.168.0.2]) by bs2.fjl.org.uk (8.16.1/8.16.1) with ESMTP id 48HBGKoF018540 for ; Tue, 17 Sep 2024 11:16:20 GMT (envelope-from freebsd-doc@fjl.co.uk) List-Id: User questions List-Archive: https://lists.freebsd.org/archives/freebsd-questions List-Help: List-Post: List-Subscribe: List-Unsubscribe: X-BeenThere: freebsd-questions@freebsd.org Sender: owner-freebsd-questions@FreeBSD.org MIME-Version: 1.0 Date: Tue, 17 Sep 2024 12:16:20 +0100 From: Frank Leonhardt To: questions Subject: Re: Zpool status -- why does a suboptimal pool show as "ONLINE"? In-Reply-To: <312af967-e5bf-4e83-b48b-7c2841719373@app.fastmail.com> References: <378D100E-FFE1-4DA7-9C52-219863A50A24@gushi.org> <312af967-e5bf-4e83-b48b-7c2841719373@app.fastmail.com> Message-ID: <0290d22f5be2eb0b324254b663076924@fjl.co.uk> X-Sender: freebsd-doc@fjl.co.uk Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Spamd-Bar: - X-Spamd-Result: default: False [-1.86 / 15.00]; SUBJECT_ENDS_QUESTION(1.00)[]; NEURAL_HAM_MEDIUM(-0.99)[-0.989]; NEURAL_HAM_SHORT(-0.92)[-0.922]; NEURAL_HAM_LONG(-0.75)[-0.745]; R_SPF_ALLOW(-0.20)[+ip4:84.45.41.208]; MIME_GOOD(-0.10)[text/plain]; ONCE_RECEIVED(0.10)[]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_ONE(0.00)[1]; RCVD_COUNT_ONE(0.00)[1]; ASN(0.00)[asn:25577, ipnet:84.45.0.0/17, country:GB]; MISSING_XM_UA(0.00)[]; MIME_TRACE(0.00)[0:+]; RCVD_TLS_LAST(0.00)[]; MID_RHS_MATCH_FROM(0.00)[]; MLMMJ_DEST(0.00)[questions@freebsd.org]; TO_MATCH_ENVRCPT_ALL(0.00)[]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; PREVIOUSLY_DELIVERED(0.00)[questions@freebsd.org]; DMARC_NA(0.00)[fjl.co.uk]; TO_DN_ALL(0.00)[]; ARC_NA(0.00)[] X-Rspamd-Queue-Id: 4X7K3b3zV3z4FTY On 2024-09-12 14:29, Dave Cottlehuber wrote: > On Thu, 12 Sep 2024, at 13:05, Dan Mahoney (Ports) wrote: >> Hey there all, >> >> I have a nagios check that assumes that if I have a suboptimal zfs >> zpool, that the word “DEGRADED” will appear in the output. One disk >> of >> a two-disk mirror seems to have faulted, but the pool still shows as >> “ONLINE”. I know I’ve seen the word “DEGRADED” in the past. What’s >> different? >> >> pool: zroot >> state: ONLINE >> status: One or more devices are faulted in response to persistent >> errors. >> Sufficient replicas exist for the pool to continue functioning >> in a >> degraded state. >> action: Replace the faulted device, or use 'zpool clear' to mark the >> device >> repaired. >> config: >> >> NAME STATE READ WRITE CKSUM >> zroot ONLINE 0 0 0 >> mirror-0 ONLINE 0 0 0 >> ada0p3 FAULTED 4 372 0 too many errors >> ada1p3 ONLINE 0 0 0 >> >> errors: No known data errors >> >> 14.1, if it matters, the disks are two innolite SATADOM’s. > > Hi Dan > > I agree that I would expect the mirror-0 at least to report DEGRADED > or similar. Hopefully one of the zfs people clarifies the logic here. > > Practically, what I do is run: > > zpool status | grep -v 'with 0 errors' | sha256 > > and check that this hash remains the same over time. It's obviously > different for each pool. Could that help for nagios? I agree. A faulted drive always used to appear as "FAULTED" and and the vdev and pool should both have been tagged "DEGRADED" (cascading upwards). A faulted drive isn't necessary taken offline, although "too many errors" suggests it should be. If this isn't a bug I'd like to know the reason why. Regards, Frank.