Constant state-changes in a ZFS array
- Reply: Michael Proto : "Re: Constant state-changes in a ZFS array"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Fri, 10 Sep 2021 15:45:54 UTC
Hey all, I had a server go dark earlier this week and after several hardware swaps I'm left scratching my head. The server is a HP DL380p Gen8 with a D3600 shelf attached, using 2 EF0600FARNA HP 600G disks in a ZFS mirror (da0 and da1) and another 22 8TB Ultrastar disks in a ZFS RAID10 for data (da3 through da23, though da23 has been removed in this situation). They're all attached to a LSI SAS2308 operating in HBA mode. The large array threw a disk shortly before which we would normally handle online like we've done dozens of times before. In this case there's a bigger problem I'm struggling with. In addition to the disk being thrown I'm now unable to bring the larger ZFS array online. Commands issued to check array status or bring it online during boot are stalling. The 2-disk zroot mirror is recognized on boot and is loading, so I can get into the OS as normal with the larger tank array failing to come online. Looking at syslog I'm seeing a regular stream of messages coming from devd regarding media and state-change events from both ZFS, GEOM and DEVFS. Sample below: Sep 10 04:28:33 backup11 devd: Processing event '!system=DEVFS subsystem=CDEV type=MEDIACHANGE cdev=da2' Sep 10 04:28:33 backup11 devd: Processing event '!system=GEOM subsystem=DEV type=MEDIACHANGE cdev=da2' Sep 10 04:28:33 backup11 devd: Processing event '!system=ZFS subsystem=ZFS type=resource.fs.zfs.statechange version=0 class=resource.fs.zfs.statechange pool_guid=9328454021323814501 vdev_guid=8915574321583737794' Sep 10 04:28:33 backup11 devd: Processing event '!system=DEVFS subsystem=CDEV type=MEDIACHANGE cdev=da2p1' Sep 10 04:28:33 backup11 devd: Processing event '!system=GEOM subsystem=DEV type=MEDIACHANGE cdev=da2p1' Sep 10 04:28:33 backup11 devd: Processing event '!system=DEVFS subsystem=CDEV type=MEDIACHANGE cdev=da6' Sep 10 04:28:33 backup11 devd: Processing event '!system=DEVFS subsystem=CDEV type=MEDIACHANGE cdev=da2' Sep 10 04:28:33 backup11 devd: Processing event '!system=GEOM subsystem=DEV type=MEDIACHANGE cdev=da2' Sep 10 04:28:33 backup11 devd: Processing event '!system=ZFS subsystem=ZFS type=resource.fs.zfs.statechange version=0 class=resource.fs.zfs.statechange pool_guid=9328454021323814501 vdev_guid=8915574321583737794' Sep 10 04:28:33 backup11 devd: Processing event '!system=DEVFS subsystem=CDEV type=MEDIACHANGE cdev=da2p1' Sep 10 04:28:33 backup11 devd: Processing event '!system=GEOM subsystem=DEV type=MEDIACHANGE cdev=da2p1' Sep 10 04:28:33 backup11 devd: Processing event '!system=DEVFS subsystem=CDEV type=MEDIACHANGE cdev=da6' Sep 10 04:28:33 backup11 devd: Processing event '!system=GEOM subsystem=DEV type=MEDIACHANGE cdev=da6' Sep 10 04:28:33 backup11 devd: Processing event '!system=ZFS subsystem=ZFS type=resource.fs.zfs.statechange version=0 class=resource.fs.zfs.statechange pool_guid=9328454021323814501 vdev_guid=7024987654522270730' Sep 10 04:28:33 backup11 devd: Processing event '!system=DEVFS subsystem=CDEV type=MEDIACHANGE cdev=da6p1' Sep 10 04:28:33 backup11 devd: Processing event '!system=GEOM subsystem=DEV type=MEDIACHANGE cdev=da6p1' Sep 10 04:28:33 backup11 devd: Processing event '!system=DEVFS subsystem=CDEV type=MEDIACHANGE cdev=da9' Sep 10 04:28:33 backup11 devd: Processing event '!system=GEOM subsystem=DEV type=MEDIACHANGE cdev=da9' Sep 10 04:28:33 backup11 devd: Processing event '!system=ZFS subsystem=ZFS type=resource.fs.zfs.statechange version=0 class=resource.fs.zfs.statechange pool_guid=9328454021323814501 vdev_guid=4207599288564790488' Sep 10 04:28:33 backup11 devd: Processing event '!system=DEVFS subsystem=CDEV type=MEDIACHANGE cdev=da9p1' Sep 10 04:28:33 backup11 devd: Processing event '!system=GEOM subsystem=DEV type=MEDIACHANGE cdev=da9p1' The disk devices appearing in these messages are all disks in the RAID10 array. They appear as a group, every 5 seconds. Furthermore the state changes seem to be happening evenly across all affected devices with the exception of da15 which is precisely half the volume. Here's a count from /var/log/messages piped to sort and uniq (count, then message): 32100 cdev=da10' 32100 cdev=da10p1' 32100 cdev=da11' 32100 cdev=da11p1' 32100 cdev=da12' 32100 cdev=da12p1' 32100 cdev=da13' 32100 cdev=da13p1' 32100 cdev=da14' 32100 cdev=da14p1' 16050 cdev=da15' 16050 cdev=da15p1' 32100 cdev=da16' 32100 cdev=da16p1' 32100 cdev=da17' 32100 cdev=da17p1' 32100 cdev=da18' 32100 cdev=da18p1' 32100 cdev=da19' 32100 cdev=da19p1' 32100 cdev=da2' 32100 cdev=da20' 32100 cdev=da20p1' 32100 cdev=da21' 32100 cdev=da21p1' 32100 cdev=da22' 32100 cdev=da22p1' 32100 cdev=da2p1' 32100 cdev=da3' 32100 cdev=da3p1' 32100 cdev=da4' 32100 cdev=da4p1' 32100 cdev=da5' 32100 cdev=da5p1' 32100 cdev=da6' 32100 cdev=da6p1' 32100 cdev=da7' 32100 cdev=da7p1' 32100 cdev=da8' 32100 cdev=da8p1' 32100 cdev=da9' 32100 cdev=da9p1' I can run diskinfo against all the listed disks no problem and I see them via camcontrol. I can also issue a reset via camcontrol to both the chassis and the D3600 with no issues. sesutil-map sees the chassis, shelf, and all disk devices. So far I've swapped the LSI controller, the D3600 shelf (twice) and the cabling, same behavior. Previously when a collection of disks go problematic like this we've swapped the D3600 shelf or occasionally just reseated the external cabling and everything came back normal. Not this time. I'm scheduling a chassis swap for next week but figured I'd throw this out here to see if anyone has seen this before. Thanks! Mike Proto