zfs mirror pool online but drives have read errors
Date: Sat, 26 Mar 2022 16:45:57 UTC
Hi all, English is not my native language,sorry about any errors I'm experiencing something which I don't fully understand, maybe someone here can offer some insight. I have a zfs mirror of 2 Samsung 980 pro 2TB nvme drives, according to zfs the pool is online, It did repair 54M on the last scrub, I did another scrub today and again repairs are needed (only 128K this time). pool: zextra state: ONLINE scan: scrub repaired 54M in 0 days 00:41:42 with 0 errors on Thu Mar 24 09:44:02 2022 config: NAME STATE READ WRITE CKSUM zextra ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 nvd2 ONLINE 0 0 0 nvd3 ONLINE 0 0 0 errors: No known data errors In dmesg I have messages like this: nvme2: UNRECOVERED READ ERROR (02/81) sqid:3 cid:80 cdw0:0 nvme2: READ sqid:8 cid:119 nsid:1 lba:3831589512 len:256 nvme2: UNRECOVERED READ ERROR (02/81) sqid:8 cid:119 cdw0:0 nvme2: READ sqid:2 cid:123 nsid:1 lba:186822304 len:256 nvme2: UNRECOVERED READ ERROR (02/81) sqid:2 cid:123 cdw0:0 nvme2: READ sqid:5 cid:97 nsid:1 lba:186822560 len:256 also for the other drive: nvme3: READ sqid:7 cid:84 nsid:1 lba:1543829024 len:256 nvme3: UNRECOVERED READ ERROR (02/81) sqid:7 cid:84 cdw0:0 smartctl does see the errors (but still says SMART overall-health self-assessment test result: PASSED ): Media and Data Integrity Errors: 190 Error Information Log Entries: 190 Error Information (NVMe Log 0x01, 16 of 64 entries) Num ErrCount SQId CmdId Status PELoc LBA NSID VS 0 190 1 0x006e 0xc502 0x000 3649951416 1 - 1 189 6 0x0067 0xc502 0x000 2909882960 1 - and for the other drive: Media and Data Integrity Errors: 284 Error Information Log Entries: 284 Is the following thinking somewhat correct ? -zfs doesn't remove the drives because it has no write errors and I've been lucky so far in that read errors were repairable. -Both drives are unreliable, if it was a hardware (both sit on a pcie card, not the motherboard) or software problem elsewhere smartctl would not find these errors in the drive logs. I'll replace one drive and see if any of the errors go away for that drive, If this works I'll replace the other one as well, I have this same setup on another machine, this one is error free. Could more expensive ssd's made a difference here ? according to smartctl I've now written 50TB, these drives should be good for 1200TBW I backup the drives by making a snapshot and then using "zfs send > imgfile" to a hard drive, what would have have happened here if more and more read errors would occur ? I may change this to a separate imgfile for the even and uneven days, or even one for every day of the week if I have enough room for that. thx for any input Bram