Re: Unable to replace drive in raidz1

From: Chris Ross <cross+freebsd_at_distal.com>
Date: Fri, 06 Sep 2024 19:34:36 UTC

> On Sep 6, 2024, at 15:16, Wes Morgan <morganw@gmail.com> wrote:
> 
> You probably don't want that. You will have to use the glabel dev, which will not be the same size as your other devices. IIRC you have no control over what device node the system finds first for the pool. Even if you use GPT labels, the daXpY device will still exist. 

Right.  But if I don’t _use_ those device names, it won’t matter.  If I use /dev/label/foo, or /dev/gpt/foo, I’ll just always use those.  I just did that with the ufs disk I have since it moved names, now it’s "/dev/ufs/drive12” in /etc/fstab et al.

I want to have some sort of label.  I’d rather not have to add a partitioning scheme to the disk if I know I’m just going to use the whole disk just to get a label, but I suppose if I have to I can.  Though I’d have to do it one disk at a time.  :-)

> 
>> The former da3 is off-line, out of the chassis.  I replaced a disk in a full chassis, having them both online at the same time is not possible.  That drive in ZFS’s mind is only faulted because I tried “zpool offline -f” on it to see if that helped.
> 
> It sounds like you have replaced the wrong device. Check the "zpool history" to see what you did. 
> 
> In your earlier message, three devices were shown in each raidz, when what you should be seeing is that one raidz has an offline device identified by guid and maybe "was /dev/da3" that is being replaced, along with the replacement device. I don't see any of that. 

History attached.  There is no replacement device (sub-vdev) until after the “zpool replace” starts, which it won’t.

>> I didn’t initiate a replace until after the disks were physically changed.  Although in this conversation realize that things likely got confused by the replacement in the kernel’s mind of da3 with what used to be da4.  :-/
> 
> This is why your zpool history will be helpful. What did you actually try to replace, and what did you mean to replace. 

All of my history since the last previous boot in May.

2024-09-05.09:40:14 zpool offline tank da3
2024-09-05.14:26:44 zpool import -c /etc/zfs/zpool.cache -a -N
2024-09-05.14:32:45 zpool import -c /etc/zfs/zpool.cache -a -N
2024-09-05.14:52:18 zpool offline tank da3
2024-09-05.14:53:51 zpool offline tank da3
2024-09-05.14:59:43 zpool offline -f tank da3
2024-09-05.15:02:53 zpool clear tank
2024-09-05.15:07:41 zpool online tank da3
2024-09-05.15:10:00 zpool add tank spare da10
2024-09-05.15:10:20 zpool offline -f tank da3
2024-09-05.15:35:23 zpool remove tank da10
2024-09-05.15:54:35 zpool scrub tank
2024-09-05.16:01:12 zpool set autoreplace=on tank
2024-09-05.16:01:24 zpool set autoexpand=on tank
2024-09-05.16:02:16 zpool add -o ashift=9 tank spare da10
2024-09-06.10:10:20 zpool remove tank da10

So, I offline’d the disk-to-be-replaced at 09:40 yesterday, then I shut the system down, removed that physical device replacing it with a larger disk, and rebooted.  I suspect the “offline”s after that are me experimenting when it was telling me it couldn’t start the replace action I was asking for.

The scrub I started yesterday just because the replace says sometihng about an operation in progress, so I did that.  It completed with no issues, but nothing changed w.r.t. my current problem.

I’m pretty sure the problem here is that the old da3 went away, and a new da3 came online as a member of raidz1-1.  The new disk I added came online as da10, for some reason.  I had to resolve the issue of the UFS disk which used to be da10 now being da9, but that was easy enough.  Just unexpected.

      - Chris