Re: ZFS: Rescue FAULTED Pool

From: A FreeBSD User <freebsd_at_walstatt-de.de>
Date: Sat, 01 Feb 2025 08:57:15 UTC
Am Thu, 30 Jan 2025 16:13:56 -0500
Allan Jude <allanjude@freebsd.org> schrieb:

> On 1/30/2025 6:35 AM, A FreeBSD User wrote:
> > Am Wed, 29 Jan 2025 03:45:25 -0800
> > David Wolfskill <david@catwhisker.org> schrieb:
> > 
> > Hello, thanks for responding.
> >   
> >> On Wed, Jan 29, 2025 at 11:27:01AM +0100, FreeBSD User wrote:  
> >>> Hello,
> >>>
> >>> a ZFS pool (RAINDZ(1)) has been faulted. The pool is not importable
> >>> anymore. neither with import -F/-f.
> >>> Although this pool is on an experimental system (no backup available)
> >>> it contains some data to reconstruct them would take a while, so I'd
> >>> like to ask whether there is a way to try to "de-fault" such a pool.  
> >>
> >> Well, 'zpool clear ...' "Clears device errors in a pool." (from "man
> >> zpool".
> >>
> >> It is, however, not magic -- it doesn't actually fix anything.  
> > 
> > For the record: I tried EVERY network/search  available method useful for common
> > "administrators", but hoped people are abe to manipulate deeper stuff via zdb ...
> >   
> >>
> >> (I had an issue with a zpool which had a single SSD device as a ZIL; the
> >> ZIL device failed after it had accepted some data to be written to the
> >> pool, but before the data could be read and transferred to the spinning
> >> disks.  ZFS was quite unhappy about that.  I was eventually able to copy
> >> the data elsewhere, destroy the old zpool, recreate it *without* that
> >> single point of failure, then copy the data back.  And I learned to
> >> never create a zpool with a *single* device as a separate ZIL.)  
> > 
> > Well, in this case I do not use dedicated ZIL drives. I also made several experiences with
> > "single" ZIL drive setups, but a dedicated ZIL is mostly useful in cases were you have
> > graveyard full of inertia-suffering, mass-spinning HDDs - if I'm right the concept of SSD
> > based ZIL would be of no use/effect in that case. So I ommited tose.
> >   
> >>  
> >>> The pool is comprised from 7 drives as a RAIDZ1, one of the SSDs
> >>> faulted but I pulled the wrong one, so the pool ran into suspended
> >>> state.  
> >>
> >> Can you put the drive you pulled back in?  
> > 
> > Every single SSD originally plugged in is now back in place, even the faulted one (which
> > doesn't report any faults at the moment).
> > 
> > Although the pool isn't "importable", zdb reports its existence, amongst zroot (which
> > resides on a dedicated drive).
> >   
> >>  
> >>> The host is running the lates Xigmanas BETA, which is effectively
> >>> FreeBSD 14.1-p2, just for the record.
> >>>
> >>> I do not want to give up, since I hoped there might be a rude but
> >>> effective way to restore the pool even under datalosses ...
> >>>
> >>> Thanks in advance,
> >>>
> >>> Oliver
> >>> ....  
> >>
> >> Good luck!
> >>
> >> Peace,
> >> david  
> > 
> > 
> > Well, this is a hard and painful lecture to learn, if there is no chance to get back the
> > pool.
> > 
> > A warning (but this seems to be useless in the realm of professionals): I used a bunch of
> > cheap spotmarket SATA SSDs, a brand called "Intenso" common also here in Good old Germany.
> > Some of those SSDs do have working LED when used with a Fujitsu SAS HBA controller - but
> > those died very quickly from suffering some bus errors. Another bunch of those SSDs do not
> > have working LED (not blinking on access), but lasted a bit longer. The problem with those
> > SSDs is: I can not find the failing device easily by accessing the failed drive by writing
> > massive data via dd, if possible.
> > I also ordered alternative SSDs from a more expensive brand - but bad Karma ...
> > 
> > Oliver
> > 
> >   
> 
> The most useful thing to share right now would be the output of `zpool 
> import` (with no pool name) on the rebooted system.
> 
> That will show where the issues are, and suggest how they might be solved.
> 

Hello, this exactly happens when trying to import the pool. Prior to the loss, device da1p1
has been faulted with numbers in the colum/columns "corrupted data"/further not seen now.


 ~# zpool import
   pool: BUNKER00
     id: XXXXXXXXXXXXXXXXXXXX
  state: FAULTED
status: The pool metadata is corrupted.
 action: The pool cannot be imported due to damaged devices or data.
        The pool may be active on another system, but can be imported using
        the '-f' flag.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-72
 config:

        BUNKER00    FAULTED  corrupted data
          raidz1-0  ONLINE
            da2p1   ONLINE
            da3p1   ONLINE
            da4p1   ONLINE
            da7p1   ONLINE
            da6p1   ONLINE
            da1p1   ONLINE
            da5p1   ONLINE


 ~# zpool import -f BUNKER00
cannot import 'BUNKER00': I/O error
        Destroy and re-create the pool from
        a backup source.


~# zpool import -F BUNKER00
cannot import 'BUNKER00': one or more devices is currently unavailable

-- 

A FreeBSD user