Re: Unable to replace drive in raidz1

From: Alan Somers <asomers_at_freebsd.org>
Date: Fri, 06 Sep 2024 17:02:14 UTC
On Fri, Sep 6, 2024 at 10:51 AM Chris Ross <cross+freebsd@distal.com> wrote:
>
>
>
> > On Sep 6, 2024, at 11:32, Alan Somers <asomers@freebsd.org> wrote:
> >
> > "zpool replace" is indeed the correct command.  There's no need to run
> > "zpool offline" first, and "zpool remove" is wrong.  Since "zpool
> > replace" is still failing, are you sure that da10 is still the correct
> > device name after all disks got renumbered?  If you're sure, then you
> > might run "zdb -l /dev/da10" to see what ZFS thinks is on that disk.
> >
>
> I can confirm that da10 is still the new disk I put into place of prior da3.
>
>
> > On Sep 6, 2024, at 11:43, mike tancsa <mike@sentex.net> wrote:
> > I would triple check to see what the devices are that are part of the pool.  I wish there was a way to tell zfs to only display one or the other.  So list out what diskid/DISK-K1GMBN9D, diskid/DISK-K1GMEDMD... to diskid/DISK-3WJ7ZMMJ are in terms of /dev/da* actually are.  I have some controllers that will re-order the disks on every reboot.  glabel status and camcontrol devlist should help verify
>
>
> camcontrol devlist lets me know that the three HGST drives making up
> zraid1-1 are da3,da4,da5 and the three WD drives making up
> zraid1-2 are da6,da7,da8.  So, like before, just moved down a
> number because the prior da3 went away and a new disk in that
> physical slot became da10.  (da9 is a loose JBOD single with ufs
> on it, previously da10, in slot 12 of 12)
>
> da10 is in fact still the disk in slot3 of the chassis, zdb -l shows
> the below.  I did add and remove it as a spare while trying things,
> that may be why it shows up this way.
>
>              - Chris
>
> % sudo zdb -l /dev/da10
> ------------------------------------
> LABEL 0
> ------------------------------------
>     version: 5000
>     name: 'tank'
>     state: 0
>     txg: 0
>     pool_guid: 3456317866677065800
>     errata: 0
>     hostid: 2747523522
>     hostname: 'frizzen02.devit.ciscolabs.com'
>     top_guid: 2495145666029787532
>     guid: 2495145666029787532
>     vdev_children: 3
>     vdev_tree:
>         type: 'disk'
>         id: 0
>         guid: 2495145666029787532
>         path: '/dev/da10'
>         phys_path: 'id1,enc@n584b2612f2c321bd/type@0/slot@3/elmdesc@ArrayDevice03'
>         whole_disk: 1
>         metaslab_array: 0
>         metaslab_shift: 0
>         ashift: 12
>         asize: 22000965255168
>         is_log: 0
>         create_txg: 18008413
>     features_for_read:
>         com.delphix:hole_birth
>         com.delphix:embedded_data
>     create_txg: 18008413
>     labels = 0 1 2 3

This looks like you got into a split-brain situation where the disks
have inconsistent labels.  Most disks think that da10 is not a member
of the pool, but da10 thinks that it is.  Perhaps you added it as a
spare, then physically removed it, and then did a "zpool remove" to
remove the spare from the configuration?  If you're very very very
sure that there is no data on da10 that you care about, you can do
"zpool labelclear -f /dev/da10"