Re: Unable to replace drive in raidz1

From: Wes Morgan <morganw_at_gmail.com>
Date: Fri, 06 Sep 2024 18:39:24 UTC

On September 6, 2024 1:21:06 PM CDT, Chris Ross <cross+freebsd@distal.com> wrote:
>
>> On Sep 6, 2024, at 14:08, mike tancsa <mike@sentex.net> wrote:
>> 
>> On 9/6/2024 2:06 PM, Chris Ross wrote:
>>> How can I map the diskid’s listed to the underlying device?either by serial number or da#…
>>> 
>> What does 
>> glabel status
>
>That shows labels for many drives, though not da1 and da2, the
>remaining members of zraid1-0.  Intersting.  But, I hope the below
>Is the fix...
>
>> On Sep 6, 2024, at 14:10, Alan Somers <asomers@freebsd.org> wrote:
>> 
>> Ahh, this means that there are two different vdevs that can be
>> described by "da3".  You can still refer to them unambiguously by guid
>> though.  Do "zpool status -g" to find the guid of the disk that you
>> want to replace, and then do "zpool replace <GUID> /dev/da10”
>
>
>Ahh, okay.  That makes sense.  I have ever only known how to replace things
>using the key that "zpool status” shows.  Thanks for that!
>
>Oh.  Trying, that doesn’t work either.  :-/
>
>NAME                      STATE     READ WRITE CKSUM
>tank                      DEGRADED     0     0     0
> 16506780107187041124    DEGRADED     0     0     0
>   9127016430593660128   FAULTED      0     0     0  external device fault
>   4094297345166589692   ONLINE       0     0     0
>   17850258180603290288  ONLINE       0     0     0
>[…]
>% sudo zpool replace tank 9127016430593660128 /dev/da10
>cannot replace 9127016430593660128 with /dev/da10: already in replacing/spare config; wait for completion or use 'zpool detach’
>% sudo zpool replace tank 9127016430593660128 diskid/DISK-ZGG0A2PA
>cannot replace 9127016430593660128 with diskid/DISK-ZGG0A2PA: already in replacing/spare config; wait for completion or use 'zpool detach'
>
>Tried with /dev/da10, and the diskid for da10 reported by glabel status.
>
>         - Chris
>

You should make the changes to your /boot/loader.conf as suggested earlier by Freddie Cash and reboot. This will eliminate all the confusion with diskid. Then run "zpool clear", which, if da3 is still online and not completely dead, the pool should come out of the faulted state. Check zpool status to look for this alleged replacement in progress. If it is truly trying to replace a device, it should show up in zpool status with the actual device, or the guid if it can't find the device.

If you have initiated a replace, and the replacing disk has now been "lost" or unlabeled, you are in a bind. I ran into this problem many years ago, and I thought it was fixed, but the bug was called something like "can't replace a replacing vdev". I ultimately solved my problem by manually editing a fake vdev to have the same guid as the missing device, restarting the replace and then canceling it before zfs realized it was fake. But, I am almost certain that zpool cancel can do this now, with the guid.

If da10 has a label that says it is in the pool, it is probably the "replacing" vdev and should be picked up...