zpool/zfs-commands hang; ZFS filesystem/ZVOLs still running.
Date: Fri, 17 Jun 2022 15:46:07 UTC
Good morning, I have a 13.1 system where twice now (required hard restart last time to resolve) since upgrading from 12.x I have had zfs/zpool-command operations start to hang indefinitely, while the underlying pools are still actively running/processing data via both filesystem and device interfaces. I am running a custom kernel. There are no what I would consider extreme modifications, just a bunch of disabled drivers, and using CPUTYPE?=native. (E5-2609 v2). This hardware and kernel config (on 12.x) had been running without issues for years. Best I can tell from what monitoring data I have, and the timestamps on hung operations, this started with a zfs destroy operation (cron / zfstools scheduled snapshots); PID 12835 below at 6/16 10:00:07. Of tangential interest, the time and lstart fields for 12835 and 14356 are usually empty when I run the ps command below, but happened to show up when I grabbed this output. zabbix_agentd is a non-privileged monitoring daemon; it periodically polls (among other things) filesystem information, and its last message out before wedging was at 10:10. You can see that the zfs send (14356, started at 11:05) accumulated some CPU time before hanging, while it started almost an hour after a zfs snapshot destroy operation. From the destination side of that send (remote host) the last (zfs recv) output indicates that it was receiving an incremental update when 14356 stopped producing new data. I don't know if there is anything else of interest I can provide at this point, as I'm sure the first recommendation will (understandably) be to try vanilla. The system is still up and running, with ZFS access through the filesystem layer (+NFS to remote hosts and bhyve VM) and device layers (zvols via bhyve VM and remote iSCSI) still actively processing, and I'm not planning to take it down to try vanilla until some tasks that are in process via those interfaces are complete. Let me know if I can provide any other useful debugging information off the still-running system. Thanks for any suggestions, and as always thank you for the excellent software! - Eric # ps -Ao state,time,pid,lstart,lockname,mwchan,args | awk '/^D[^L]/' D 7:04.02 1452 Mon Jun 13 16:09:31 2022 - vfs_busy zabbix_agentd: active checks #1 [processing active checks] (zabbix_agentd) D 0:00.01 12835 Thu Jun 16 10:00:07 2022 - zfs tear zfs destroy -d kspace/users/eborisch@zfs-auto-snap_hourly-2022-06-15-07h00 D 0:05.09 14356 Thu Jun 16 11:05:01 2022 - zfs tear /sbin/zfs send -L -RI @MARK-mr-dewey kspace@NEW-mr-dewey D 0:00.00 23970 Fri Jun 17 03:01:00 2022 - zfsdev_s zpool list Ds 0:00.00 24280 Fri Jun 17 03:32:00 2022 - zfsdev_s /sbin/zfs get -H name system@MARK-mr-dewey Ds 0:00.00 24881 Fri Jun 17 04:32:00 2022 - zfsdev_s /sbin/zfs get -H name db@MARK-mr-dewey D 0:00.00 30957 Fri Jun 17 07:09:58 2022 - vfs_busy /bin/df -h D+ 0:00.00 31933 Fri Jun 17 08:45:14 2022 - zfsdev_s zpool status procstat -k output: 1452 101721 zabbix_agentd - mi_switch _sleep vfs_busy kern_getfsstat sys_getfsstat amd64_syscall fast_syscall_common 12835 100336 zfs - mi_switch _sleep rms_wlock zfs_umount dounmount zfsctl_snapshot_unmount zfs_ioc_destroy_snaps zfsdev_ioctl_common zfsdev_ioctl devfs_ioctl vn_ioctl devfs_ioctl_f kern_ioctl sys_ioctl amd64_syscall fast_syscall_common 14356 101770 zfs - mi_switch _sleep rms_wlock zfs_umount dounmount zfsctl_snapshot_unmount dsl_dataset_user_release_impl dsl_dataset_user_release_onexit zfs_onexit_destroy zfsdev_close devfs_fpdrop devfs_close_f _fdrop closef closefp amd64_syscall fast_syscall_common 23970 100339 zpool - mi_switch _sx_xlock_hard zfsdev_open devfs_open VOP_OPEN_APV vn_open_vnode vn_open_cred kern_openat filemon_wrapper_openat amd64_syscall fast_syscall_common 24280 101845 zfs - mi_switch _sx_xlock_hard zfsdev_open devfs_open VOP_OPEN_APV vn_open_vnode vn_open_cred kern_openat filemon_wrapper_openat amd64_syscall fast_syscall_common 24881 101280 zfs - mi_switch _sx_xlock_hard zfsdev_open devfs_open VOP_OPEN_APV vn_open_vnode vn_open_cred kern_openat filemon_wrapper_openat amd64_syscall fast_syscall_common 30957 101229 df - mi_switch _sleep vfs_busy kern_getfsstat sys_getfsstat amd64_syscall fast_syscall_common 31933 104201 zpool - mi_switch _sx_xlock_hard zfsdev_open devfs_open VOP_OPEN_APV vn_open_vnode vn_open_cred kern_openat filemon_wrapper_openat amd64_syscall fast_syscall_common