ZFS lock up 9-stable r244911 (Jan)
Andriy Gapon
avg at FreeBSD.org
Sat Jan 19 17:01:07 UTC 2013
on 18/01/2013 17:01 Gavin Atkinson said the following:
>
> Hi all,
>
> I have a machine on which ZFS appears to have locked up, and any processes
> that attempt to access the ZFS filesystem. This machine is running
> 9-stable amd64 r244911 (though from cvs, not SVN), and therefore I believe
> has all of avg's ZFS deadlock patches.
>
> This machine has both UFS and ZFS filesystems. All of the "system"
> filesystems are on UFS, and as a result the machine itself is responsive
> and I can investigate state using kgdb against the live kernel. I've
> included all thread backtraces, a couple of other bits relating to held
> locks, and ps/sysctl output at
> http://people.freebsd.org/~gavin/tay-zfs-hang.txt
> http://people.freebsd.org/~gavin/tay-sysctl-a.txt
> http://people.freebsd.org/~gavin/tay-ps-auxwwwH.txt
>
> This machine was in use as a pkgng package builder, using poudriere.
> Poudriere makes heavy use of zfs filesystems within jails, including "zfs
> get", "zfs set", "zfs snapshot", "zfs rollback", "zfs diff" and other
> commands, although there do not appear to be any instances of the zfs
> process running currently. At the time of the hang 16 parallel builds were
> in progress,
>
> The underlying disk subsystem is a single hardware RAID-10 on a twa
> controller, and the zpool is on a single partition of this device. The
> RAID-10 itself is intact, the controller reports no errors. There is no
> L2ARC or separate ZIL. The UFS filesystems (still seem to be fully
> functional) are on separate partitions on the same underlying device, so I
> do not believe the underlying storage is having issues. I can "dd" from
> the underlying ZFS partition without problem. Nothing has been logged to
> /var/log/messages.
Based on the above information plus some additional debugging information that
Gavin has kindly provided to me I've developed the following "theory" to explain
this deadlock.
I believe that was very high disk load (overwhelmingly high load) before the
deadlock occurred. Further I think that there was a substantial number of high
priority writes. Under those conditions the number of in-progress/pending zio-s
was constantly at zfs_vdev_max_pending (by default 10). Number of queued zio-s
was above hundred:
vdev_queue = {
vq_deadline_tree = {avl_root = 0xfffffe0338dbb248, avl_compar =
0xffffffff816855b0 <vdev_queue_deadline_compare>,
avl_offset = 584, avl_numnodes = 116, avl_size = 896},
vq_read_tree = {avl_root = 0xfffffe019d0b65b0, avl_compar =
0xffffffff81685600 <vdev_queue_offset_compare>, avl_offset = 560, avl_numnodes =
8, avl_size = 896},
vq_write_tree = { avl_root = 0xfffffe03e3d19230, avl_compar =
0xffffffff81685600 <vdev_queue_offset_compare>, avl_offset = 560, avl_numnodes =
108, avl_size = 896},
vq_pending_tree = {avl_root = 0xfffffe025e32c230, avl_compar =
0xffffffff81685600 <vdev_queue_offset_compare>, avl_offset = 560, avl_numnodes =
10, avl_size = 896},
vq_lock = {lock_object = {lo_name = 0xffffffff8172afc9 "vq->vq_lock",
lo_flags = 40960000, lo_data = 0, lo_witness = 0x0}, sx_lock = 1}},
vdev_cache = {vc_offset_tree = {avl_root = 0x0, avl_compar =
0xffffffff81681740 <vdev_cache_offset_compare>, avl_offset = 24, avl_numnodes =
0, avl_size = 88}, vc_lastused_tree = { avl_root = 0x0, avl_compar =
0xffffffff81681760 <vdev_cache_lastused_compare>, avl_offset = 48, avl_numnodes
= 0, avl_size = 88}
Apparently processing of zio-s was so lagging behind that some executed zio-s
triggered "late arrival" condition. My incomplete understanding shows here - I
am not sure what exactly triggers the condition and what is so special about it,
but from the following stack traces we can see that all five of
zio_write_intr_high taskqueue threads were executing dmu_sync_late_arrival_done():
0 100432 kernel zio_write_intr_h mi_switch+0x186 sleepq_wait+0x42
_sx_xlock_hard+0x426 _sx_xlock+0x51 txg_rele_to_sync+0x36 dmu_tx_commit+0xf1
dmu_sync_late_arrival_done+0x52 zio_done+0x353 zio_execute+0xc3 zio_done+0x3d0
zio_execute+0xc3 taskqueue_run_locked+0x74 taskqueue_thread_loop+0x46
fork_exit+0x11f fork_trampoline+0xe
0 100433 kernel zio_write_intr_h mi_switch+0x186 sleepq_wait+0x42
_sx_xlock_hard+0x426 _sx_xlock+0x51 txg_rele_to_sync+0x36 dmu_tx_commit+0xf1
dmu_sync_late_arrival_done+0x52 zio_done+0x353 zio_execute+0xc3 zio_done+0x3d0
zio_execute+0xc3 taskqueue_run_locked+0x74 taskqueue_thread_loop+0x46
fork_exit+0x11f fork_trampoline+0xe
0 100434 kernel zio_write_intr_h mi_switch+0x186 sleepq_wait+0x42
_sx_xlock_hard+0x426 _sx_xlock+0x51 txg_rele_to_sync+0x36 dmu_tx_commit+0xf1
dmu_sync_late_arrival_done+0x52 zio_done+0x353 zio_execute+0xc3 zio_done+0x3d0
zio_execute+0xc3 taskqueue_run_locked+0x74 taskqueue_thread_loop+0x46
fork_exit+0x11f fork_trampoline+0xe
0 100435 kernel zio_write_intr_h mi_switch+0x186 sleepq_wait+0x42
_sx_xlock_hard+0x426 _sx_xlock+0x51 txg_rele_to_sync+0x36 dmu_tx_commit+0xf1
dmu_sync_late_arrival_done+0x52 zio_done+0x353 zio_execute+0xc3 zio_done+0x3d0
zio_execute+0xc3 taskqueue_run_locked+0x74 taskqueue_thread_loop+0x46
fork_exit+0x11f fork_trampoline+0xe
0 100436 kernel zio_write_intr_h mi_switch+0x186 sleepq_wait+0x42
_sx_xlock_hard+0x426 _sx_xlock+0x51 txg_rele_to_sync+0x36 dmu_tx_commit+0xf1
dmu_sync_late_arrival_done+0x52 zio_done+0x353 zio_execute+0xc3 zio_done+0x3d0
zio_execute+0xc3 taskqueue_run_locked+0x74 taskqueue_thread_loop+0x46
fork_exit+0x11f fork_trampoline+0xe
In additional to the above, the taskqueue associated with the above threads has
another 9 pending tasks.
As you can see that "late arrival" code path involves txg_rele_to_sync() where
an instance tc_lock is acquired.
Further, it looks that tc_lock instances are held by the following threads:
64998 101921 pkg initial thread mi_switch+0x186 sleepq_wait+0x42
_sx_xlock_hard+0x426 _sx_xlock+0x51 txg_delay+0x9d
dsl_pool_tempreserve_space+0xd5 dsl_dir_tempreserve_space+0x154
dmu_tx_assign+0x370 zfs_freebsd_create+0x310 VOP_CREATE_APV+0x31
vn_open_cred+0x4b7 kern_openat+0x20a amd64_syscall+0x540 Xfast_syscall+0xf7
66152 102491 pkg initial thread mi_switch+0x186 sleepq_wait+0x42
_sx_xlock_hard+0x426 _sx_xlock+0x51 txg_delay+0x9d
dsl_pool_tempreserve_space+0xd5 dsl_dir_tempreserve_space+0x154
dmu_tx_assign+0x370 zfs_freebsd_write+0x45b VOP_WRITE_APV+0xb2 vn_write+0x37e
vn_io_fault+0x90 dofilewrite+0x85 kern_writev+0x6c sys_write+0x64
amd64_syscall+0x540 Xfast_syscall+0xf7
75803 101638 find - mi_switch+0x186 sleepq_wait+0x42
_sx_xlock_hard+0x426 _sx_xlock+0x51 txg_delay+0x9d
dsl_pool_tempreserve_space+0xd5 dsl_dir_tempreserve_space+0x154
dmu_tx_assign+0x370 zfs_inactive+0x1b7 zfs_freebsd_inactive+0x1a vinactive+0x86
vputx+0x2d8 sys_fchdir+0x356 amd64_syscall+0x540 Xfast_syscall+0xf7
75809 102932 find - mi_switch+0x186 sleepq_wait+0x42
_sx_xlock_hard+0x426 _sx_xlock+0x51 txg_delay+0x9d
dsl_pool_tempreserve_space+0xd5 dsl_dir_tempreserve_space+0x154
dmu_tx_assign+0x370 zfs_inactive+0x1b7 zfs_freebsd_inactive+0x1a vinactive+0x86
vputx+0x2d8 sys_fchdir+0x356 amd64_syscall+0x540 Xfast_syscall+0xf7
75813 101515 find - mi_switch+0x186 sleepq_wait+0x42
_sx_xlock_hard+0x426 _sx_xlock+0x51 txg_delay+0x9d
dsl_pool_tempreserve_space+0xd5 dsl_dir_tempreserve_space+0x154
dmu_tx_assign+0x370 zfs_inactive+0x1b7 zfs_freebsd_inactive+0x1a vinactive+0x86
vputx+0x2d8 sys_fchdir+0x356 amd64_syscall+0x540 Xfast_syscall+0xf7
77468 101412 update-mime-databas initial thread mi_switch+0x186
sleepq_wait+0x42 _sx_xlock_hard+0x426 _sx_xlock+0x51 txg_delay+0x9d
dsl_pool_tempreserve_space+0xd5 dsl_dir_tempreserve_space+0x154
dmu_tx_assign+0x370 zfs_freebsd_write+0x45b VOP_WRITE_APV+0xb2 vn_write+0x37e
vn_io_fault+0x90 dofilewrite+0x85 kern_writev+0x6c sys_write+0x64
amd64_syscall+0x540 Xfast_syscall+0xf7
These threads calls txg_delay also because of the high load.
In the code we see that dmu_tx_assign first grabs an instance of tc_lock and
then calls dsl_dir_tempreserve_space. Also, we see that txg_delay tries to
acquire tx_sync_lock and that's where all these threads are block.
Then we see that txg_sync_thread holds tx_sync_lock, but in its turn it is
blocked waiting for its zio:
1552 100544 zfskern txg_thread_enter mi_switch+0x186 sleepq_wait+0x42
_cv_wait+0x112 zio_wait+0x61 dbuf_read+0x5e5 dmu_buf_hold+0xe0 zap_lockdir+0x58
zap_lookup_norm+0x45 zap_lookup+0x2e feature_get_refcount+0x4b
spa_feature_is_active+0x52 dsl_scan_active+0x63 txg_sync_thread+0x20d
fork_exit+0x11f fork_trampoline+0xe
So a summary.
For some completed zio-s their zio_done routines are blocked because of
dmu_sync_late_arrival_done->txg_rele_to_sync->tc_lock.
tc_locks are held by threads in dmu_tx_assign->..->txg_delay where txg_delay is
blocked on tx_sync_lock.
tx_sync_lock is held by txg_sync_thread which waits for its zio to be processed.
The zio is held on queue and is not getting processed because the vdev already
has too many pending/in-progress zio-s.
This theory looks plausible to me, but I'd like to hear what the experts think.
Even more important question is how this situation can be avoided.
--
Andriy Gapon
More information about the freebsd-fs
mailing list