Re: Unable to replace drive in raidz1
- Reply: Chris Ross : "Re: Unable to replace drive in raidz1"
- In reply to: Chris Ross : "Re: Unable to replace drive in raidz1"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Fri, 06 Sep 2024 21:22:44 UTC
On September 6, 2024 2:34:36 PM CDT, Chris Ross <cross+freebsd@distal.com> wrote: > > >> On Sep 6, 2024, at 15:16, Wes Morgan <morganw@gmail.com> wrote: >> >> You probably don't want that. You will have to use the glabel dev, which will not be the same size as your other devices. IIRC you have no control over what device node the system finds first for the pool. Even if you use GPT labels, the daXpY device will still exist. > >Right. But if I don’t _use_ those device names, it won’t matter. If I use /dev/label/foo, or /dev/gpt/foo, I’ll just always use those. I just did that with the ufs disk I have since it moved names, now it’s "/dev/ufs/drive12” in /etc/fstab et al. The labels are helpful for fstab, but zfs doesn't need fstab. In the early days of zfs on freebsd the unpartitioned device was recommended; maybe that's not accurate any longer, but I still follow it for a pool that contains vdevs with multiple devices (raidz). If you use, e.g., da0 in a pool, you cannot later replace it with a labeled device of the same size; it won't have enough sectors. >I want to have some sort of label. I’d rather not have to add a partitioning scheme to the disk if I know I’m just going to use the whole disk just to get a label, but I suppose if I have to I can. Though I’d have to do it one disk at a time. :-) ZFS will absolutely find the device if it is readable. The label on every device contains enough metadata to describe the entire vdev (and the pool I believe), including the missing devices. It's very good at finding them. The clearlabel command was added because it was a pain to get zfs to give up on a disk that has been repurposed. You really don't need the labels, but if you have trouble figuring out which disk is which, that may be the only way for you to be sure. >> >>> The former da3 is off-line, out of the chassis. I replaced a disk in a full chassis, having them both online at the same time is not possible. That drive in ZFS’s mind is only faulted because I tried “zpool offline -f” on it to see if that helped. >> >> It sounds like you have replaced the wrong device. Check the "zpool history" to see what you did. >> >> In your earlier message, three devices were shown in each raidz, when what you should be seeing is that one raidz has an offline device identified by guid and maybe "was /dev/da3" that is being replaced, along with the replacement device. I don't see any of that. > >History attached. There is no replacement device (sub-vdev) until after the “zpool replace” starts, which it won’t. > >>> I didn’t initiate a replace until after the disks were physically changed. Although in this conversation realize that things likely got confused by the replacement in the kernel’s mind of da3 with what used to be da4. :-/ >> >> This is why your zpool history will be helpful. What did you actually try to replace, and what did you mean to replace. > >All of my history since the last previous boot in May. > >2024-09-05.09:40:14 zpool offline tank da3 >2024-09-05.14:26:44 zpool import -c /etc/zfs/zpool.cache -a -N >2024-09-05.14:32:45 zpool import -c /etc/zfs/zpool.cache -a -N >2024-09-05.14:52:18 zpool offline tank da3 >2024-09-05.14:53:51 zpool offline tank da3 >2024-09-05.14:59:43 zpool offline -f tank da3 >2024-09-05.15:02:53 zpool clear tank >2024-09-05.15:07:41 zpool online tank da3 >2024-09-05.15:10:00 zpool add tank spare da10 >2024-09-05.15:10:20 zpool offline -f tank da3 >2024-09-05.15:35:23 zpool remove tank da10 >2024-09-05.15:54:35 zpool scrub tank >2024-09-05.16:01:12 zpool set autoreplace=on tank >2024-09-05.16:01:24 zpool set autoexpand=on tank >2024-09-05.16:02:16 zpool add -o ashift=9 tank spare da10 >2024-09-06.10:10:20 zpool remove tank da10 > >So, I offline’d the disk-to-be-replaced at 09:40 yesterday, then I shut the system down, removed that physical device replacing it with a larger disk, and rebooted. I suspect the “offline”s after that are me experimenting when it was telling me it couldn’t start the replace action I was asking for. This is probably where you made your mistake. Rebooting shifted another device into da3. When you tried to offline it, you were probably either targeting a device in a different raidz or one that wasn't in the pool. The output of those original offline commands would have been informative. You could also check dmesg and map the serial numbers to device assignments to figure out what device moved to da3. >The scrub I started yesterday just because the replace says sometihng about an operation in progress, so I did that. It completed with no issues, but nothing changed w.r.t. my current problem. > >I’m pretty sure the problem here is that the old da3 went away, and a new da3 came online as a member of raidz1-1. The new disk I added came online as da10, for some reason. I had to resolve the issue of the UFS disk which used to be da10 now being da9, but that was easy enough. Just unexpected. Sounds about right. In another message it seemed like the pool had started an autoreplace. So I assume you have zfsd enabled? That is what issues the replace command. Strange that it is not anywhere in the pool history. There should be syslog entries for any actions it took. In your case, it appears that you had two missing devices - the original "da3" that was physically removed, and the new da3 that you forced offline. You added da10 as a spare, when what you needed to do was a replace. Spare devices do not auto-replace without zfsd running and autoreplace set to on. This should all be reported in zpool status. In your original message, there is no sign of a replacement in progress or a spare device, assuming you didn't omit something. If the pool is only showing that a single device is missing, and that device is to be replaced by da10, zero out the first and last sectors (I think a zfs label is 128k?) to wipe out any labels and use the replace command, not spare, e.g. "zpool replace tank da3 da10", or use the missing guid as suggested elsewhere. This should work based on the information provided.