Re: ZFS deadlock in 14
- Reply: Dag-Erling_Smørgrav : "Re: ZFS deadlock in 14"
- In reply to: Dag-Erling_Smørgrav : "Re: ZFS deadlock in 14"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Sat, 19 Aug 2023 18:13:36 UTC
On 18.08.2023 18:34, Dag-Erling Smørgrav wrote: > Dag-Erling Smørgrav <des@FreeBSD.org> writes: >> Plot twist: c47116e909 _without_ the patches also appears to be working >> fine. The last kernel I know for sure deadlocks is b36f469a15, so I'm >> going to test cd25b0f740 and 28d2e3b5de. > > c47116e909 with cd25b0f740 and 28d2e3b5de reverted deadlocks, see > attached ddb.txt. I'm going to see if reverting only 28d2e3b5de but not > cd25b0f740 changes anything. Yes, it looks like ZIL-related deadlock: ZFS sync thread in zil_sync() is waiting for allocated ZIL zios to complete: Tracing command zfskern pid 5 tid 101124 td 0xfffffe0408cbb020 sched_switch() at sched_switch+0x5da/frame 0xfffffe04090f7900 mi_switch() at mi_switch+0x173/frame 0xfffffe04090f7920 sleepq_switch() at sleepq_switch+0x104/frame 0xfffffe04090f7960 _cv_wait() at _cv_wait+0x165/frame 0xfffffe04090f79c0 zil_sync() at zil_sync+0x9b/frame 0xfffffe04090f7aa0 dmu_objset_sync() at dmu_objset_sync+0x51b/frame 0xfffffe04090f7b70 dsl_pool_sync() at dsl_pool_sync+0x11d/frame 0xfffffe04090f7bf0 spa_sync() at spa_sync+0xc68/frame 0xfffffe04090f7e20 txg_sync_thread() at txg_sync_thread+0x2eb/frame 0xfffffe04090f7ef0 fork_exit() at fork_exit+0x82/frame 0xfffffe04090f7f30 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe04090f7f30 --- trap 0, rip = 0, rsp = 0, rbp = 0 --- Some thread requested fsync(), allocated zil zio, but stuck waiting for z_teardown_inactive_lock in attempt to get data to be written into zil, so zios were never even issued: Tracing command blacklistd pid 521 tid 101136 td 0xfffffe040d08d000 sched_switch() at sched_switch+0x5da/frame 0xfffffe040c25c710 mi_switch() at mi_switch+0x173/frame 0xfffffe040c25c730 sleepq_switch() at sleepq_switch+0x104/frame 0xfffffe040c25c770 _sleep() at _sleep+0x2d6/frame 0xfffffe040c25c810 rms_rlock_fallback() at rms_rlock_fallback+0xd0/frame 0xfffffe040c25c850 zfs_freebsd_reclaim() at zfs_freebsd_reclaim+0x2b/frame 0xfffffe040c25c880 VOP_RECLAIM_APV() at VOP_RECLAIM_APV+0x35/frame 0xfffffe040c25c8a0 vgonel() at vgonel+0x3a9/frame 0xfffffe040c25c910 vnlru_free_impl() at vnlru_free_impl+0x371/frame 0xfffffe040c25c990 vn_alloc_hard() at vn_alloc_hard+0xd3/frame 0xfffffe040c25c9b0 getnewvnode_reserve() at getnewvnode_reserve+0xa0/frame 0xfffffe040c25c9d0 zfs_zget() at zfs_zget+0x1f/frame 0xfffffe040c25ca80 zfs_get_data() at zfs_get_data+0x62/frame 0xfffffe040c25cb20 zil_lwb_commit() at zil_lwb_commit+0x32f/frame 0xfffffe040c25cb70 zil_lwb_write_issue() at zil_lwb_write_issue+0x4e/frame 0xfffffe040c25cbb0 zil_commit_impl() at zil_commit_impl+0x943/frame 0xfffffe040c25cd40 zfs_fsync() at zfs_fsync+0x8f/frame 0xfffffe040c25cd80 kern_fsync() at kern_fsync+0x18a/frame 0xfffffe040c25ce00 amd64_syscall() at amd64_syscall+0x138/frame 0xfffffe040c25cf30 fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe040c25cf30 --- syscall (95, FreeBSD ELF64, fsync), rip = 0x3f979eeb074a, rsp = 0x3f979a449c38, rbp = 0x3f979a449c50 --- Third thread doing zfs rollback while holding z_teardown_inactive_lock tries to wait for transaction commit, causing deadlock: Tracing command zfs pid 65636 tid 138109 td 0xfffffe0438d721e0 sched_switch() at sched_switch+0x5da/frame 0xfffffe0439b2b950 mi_switch() at mi_switch+0x173/frame 0xfffffe0439b2b970 sleepq_switch() at sleepq_switch+0x104/frame 0xfffffe0439b2b9b0 _cv_wait() at _cv_wait+0x165/frame 0xfffffe0439b2ba10 txg_wait_synced_impl() at txg_wait_synced_impl+0xeb/frame 0xfffffe0439b2ba50 txg_wait_synced() at txg_wait_synced+0xb/frame 0xfffffe0439b2ba60 zfsvfs_teardown() at zfsvfs_teardown+0x203/frame 0xfffffe0439b2bab0 zfs_ioc_rollback() at zfs_ioc_rollback+0x12f/frame 0xfffffe0439b2bb00 zfsdev_ioctl_common() at zfsdev_ioctl_common+0x612/frame 0xfffffe0439b2bbc0 zfsdev_ioctl() at zfsdev_ioctl+0x12a/frame 0xfffffe0439b2bbf0 devfs_ioctl() at devfs_ioctl+0xd2/frame 0xfffffe0439b2bc40 vn_ioctl() at vn_ioctl+0xc2/frame 0xfffffe0439b2bcb0 devfs_ioctl_f() at devfs_ioctl_f+0x1e/frame 0xfffffe0439b2bcd0 kern_ioctl() at kern_ioctl+0x286/frame 0xfffffe0439b2bd30 sys_ioctl() at sys_ioctl+0x152/frame 0xfffffe0439b2be00 amd64_syscall() at amd64_syscall+0x138/frame 0xfffffe0439b2bf30 fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe0439b2bf30 --- syscall (54, FreeBSD ELF64, ioctl), rip = 0x1afaddea3aaa, rsp = 0x1afad4058328, rbp = 0x1afad40583a0 --- Unfortunately I think the current code in main should still suffer from this specific deadlock. cd25b0f740 fixes some deadlocks in this area, may be that is why you are getting issues less often, but I don't believe it fixes this specific one, may be you was just lucky. Only https://github.com/openzfs/zfs/pull/15122 I believe should fix them. -- Alexander Motin