Re: Unable to replace drive in raidz1

From: Wes Morgan <morganw_at_gmail.com>
Date: Fri, 06 Sep 2024 19:16:39 UTC

On September 6, 2024 1:54:32 PM CDT, Chris Ross <cross+freebsd@distal.com> wrote:
>
>
>> On Sep 6, 2024, at 14:39, Wes Morgan <morganw@gmail.com> wrote:
>> 
>> 
>> You should make the changes to your /boot/loader.conf as suggested earlier by Freddie Cash and reboot. This will eliminate all the confusion with diskid. Then run "zpool clear", which, if da3 is still online and not completely dead, the pool should come out of the faulted state. Check zpool status to look for this alleged replacement in progress. If it is truly trying to replace a device, it should show up in zpool status with the actual device, or the guid if it can't find the device.
>
>I saw and appreiciated that response, but didn’t respond on that thread because I don’t _want_ to turn all of those things off.  At least, I don’t want to refer to everything by the auto-numbered da# that I think that will cause.  And, Freddie, your comment about GPT partition labels I think doesn’t apply because I don’t have GPT on my disks.  Just all one big ZFS device.  This is why I’m looking at glabel’s generic labeling now.

You probably don't want that. You will have to use the glabel dev, which will not be the same size as your other devices. IIRC you have no control over what device node the system finds first for the pool. Even if you use GPT labels, the daXpY device will still exist. 

>The former da3 is off-line, out of the chassis.  I replaced a disk in a full chassis, having them both online at the same time is not possible.  That drive in ZFS’s mind is only faulted because I tried “zpool offline -f” on it to see if that helped.

It sounds like you have replaced the wrong device. Check the "zpool history" to see what you did. 

In your earlier message, three devices were shown in each raidz, when what you should be seeing is that one raidz has an offline device identified by guid and maybe "was /dev/da3" that is being replaced, along with the replacement device. I don't see any of that. 

>> If you have initiated a replace, and the replacing disk has now been "lost" or unlabeled, you are in a bind. I ran into this problem many years ago, and I thought it was fixed, but the bug was called something like "can't replace a replacing vdev". I ultimately solved my problem by manually editing a fake vdev to have the same guid as the missing device, restarting the replace and then canceling it before zfs realized it was fake. But, I am almost certain that zpool cancel can do this now, with the guid.
>
>I didn’t initiate a replace until after the disks were physically changed.  Although in this conversation realize that things likely got confused by the replacement in the kernel’s mind of da3 with what used to be da4.  :-/

This is why your zpool history will be helpful. What did you actually try to replace, and what did you mean to replace. 


>> If da10 has a label that says it is in the pool, it is probably the "replacing" vdev and should be picked up…
>
>Da10, now also /dev/label/drive03, seems to think it’s in the pool somewhere, according to zdb -l.
>But I’m not sure if this helps.  And, following your other message saying I shouldn’t put labels
>on disks that are to be used in their entirety as ZFS devices, I’ve deleted that label and
>zlabelclear’d this device now.  (since the zfs label still had the /dev/label/ path in it)