ZFS lock up 9-stable r244911 (Jan)

Sat Jan 19 17:01:07 UTC 2013

on 18/01/2013 17:01 Gavin Atkinson said the following:
> 
> Hi all,
> 
> I have a machine on which ZFS appears to have locked up, and any processes 
> that attempt to access the ZFS filesystem.  This machine is running 
> 9-stable amd64 r244911 (though from cvs, not SVN), and therefore I believe 
> has all of avg's ZFS deadlock patches.
> 
> This machine has both UFS and ZFS filesystems.  All of the "system" 
> filesystems are on UFS, and as a result the machine itself is responsive 
> and I can investigate state using kgdb against the live kernel.  I've 
> included all thread backtraces, a couple of other bits relating to held 
> locks, and ps/sysctl output at
>  http://people.freebsd.org/~gavin/tay-zfs-hang.txt 
>  http://people.freebsd.org/~gavin/tay-sysctl-a.txt 
>  http://people.freebsd.org/~gavin/tay-ps-auxwwwH.txt
> 
> This machine was in use as a pkgng package builder, using poudriere.  
> Poudriere makes heavy use of zfs filesystems within jails, including "zfs 
> get", "zfs set", "zfs snapshot", "zfs rollback", "zfs diff" and other 
> commands, although there do not appear to be any instances of the zfs 
> process running currently. At the time of the hang 16 parallel builds were 
> in progress, 
> 
> The underlying disk subsystem is a single hardware RAID-10 on a twa 
> controller, and the zpool is on a single partition of this device.  The 
> RAID-10 itself is intact, the controller reports no errors.  There is no 
> L2ARC or separate ZIL.  The UFS filesystems (still seem to be fully 
> functional) are on separate partitions on the same underlying device, so I 
> do not believe the underlying storage is having issues.  I can "dd" from 
> the underlying ZFS partition without problem.  Nothing has been logged to 
> /var/log/messages.

Based on the above information plus some additional debugging information that
Gavin has kindly provided to me I've developed the following "theory" to explain
this deadlock.

I believe that was very high disk load (overwhelmingly high load) before the
deadlock occurred.  Further I think that there was a substantial number of high
priority writes.  Under those conditions the number of in-progress/pending zio-s
was constantly at zfs_vdev_max_pending (by default 10).  Number of queued zio-s
was above hundred:

vdev_queue = {
        vq_deadline_tree = {avl_root = 0xfffffe0338dbb248, avl_compar =
0xffffffff816855b0 <vdev_queue_deadline_compare>,
avl_offset = 584, avl_numnodes = 116, avl_size = 896},
        vq_read_tree = {avl_root = 0xfffffe019d0b65b0, avl_compar =
0xffffffff81685600 <vdev_queue_offset_compare>, avl_offset = 560, avl_numnodes =
8, avl_size = 896},
        vq_write_tree = { avl_root = 0xfffffe03e3d19230, avl_compar =
0xffffffff81685600 <vdev_queue_offset_compare>, avl_offset = 560, avl_numnodes =
108, avl_size = 896},
        vq_pending_tree = {avl_root = 0xfffffe025e32c230, avl_compar =
0xffffffff81685600 <vdev_queue_offset_compare>, avl_offset = 560, avl_numnodes =
10, avl_size = 896},

        vq_lock = {lock_object = {lo_name = 0xffffffff8172afc9 "vq->vq_lock",
lo_flags = 40960000, lo_data = 0, lo_witness = 0x0}, sx_lock = 1}},
        vdev_cache = {vc_offset_tree = {avl_root = 0x0, avl_compar =
0xffffffff81681740 <vdev_cache_offset_compare>, avl_offset = 24, avl_numnodes =
0, avl_size = 88}, vc_lastused_tree = { avl_root = 0x0, avl_compar =
0xffffffff81681760 <vdev_cache_lastused_compare>, avl_offset = 48, avl_numnodes
= 0, avl_size = 88}

Apparently processing of zio-s was so lagging behind that some executed zio-s
triggered "late arrival" condition.  My incomplete understanding shows here - I
am not sure what exactly triggers the condition and what is so special about it,
but from the following stack traces we can see that all five of
zio_write_intr_high taskqueue threads were executing dmu_sync_late_arrival_done():

    0 100432 kernel           zio_write_intr_h mi_switch+0x186 sleepq_wait+0x42
_sx_xlock_hard+0x426 _sx_xlock+0x51 txg_rele_to_sync+0x36 dmu_tx_commit+0xf1
dmu_sync_late_arrival_done+0x52 zio_done+0x353 zio_execute+0xc3 zio_done+0x3d0
zio_execute+0xc3 taskqueue_run_locked+0x74 taskqueue_thread_loop+0x46
fork_exit+0x11f fork_trampoline+0xe
    0 100433 kernel           zio_write_intr_h mi_switch+0x186 sleepq_wait+0x42
_sx_xlock_hard+0x426 _sx_xlock+0x51 txg_rele_to_sync+0x36 dmu_tx_commit+0xf1
dmu_sync_late_arrival_done+0x52 zio_done+0x353 zio_execute+0xc3 zio_done+0x3d0
zio_execute+0xc3 taskqueue_run_locked+0x74 taskqueue_thread_loop+0x46
fork_exit+0x11f fork_trampoline+0xe
    0 100434 kernel           zio_write_intr_h mi_switch+0x186 sleepq_wait+0x42
_sx_xlock_hard+0x426 _sx_xlock+0x51 txg_rele_to_sync+0x36 dmu_tx_commit+0xf1
dmu_sync_late_arrival_done+0x52 zio_done+0x353 zio_execute+0xc3 zio_done+0x3d0
zio_execute+0xc3 taskqueue_run_locked+0x74 taskqueue_thread_loop+0x46
fork_exit+0x11f fork_trampoline+0xe
    0 100435 kernel           zio_write_intr_h mi_switch+0x186 sleepq_wait+0x42
_sx_xlock_hard+0x426 _sx_xlock+0x51 txg_rele_to_sync+0x36 dmu_tx_commit+0xf1
dmu_sync_late_arrival_done+0x52 zio_done+0x353 zio_execute+0xc3 zio_done+0x3d0
zio_execute+0xc3 taskqueue_run_locked+0x74 taskqueue_thread_loop+0x46
fork_exit+0x11f fork_trampoline+0xe
    0 100436 kernel           zio_write_intr_h mi_switch+0x186 sleepq_wait+0x42
_sx_xlock_hard+0x426 _sx_xlock+0x51 txg_rele_to_sync+0x36 dmu_tx_commit+0xf1
dmu_sync_late_arrival_done+0x52 zio_done+0x353 zio_execute+0xc3 zio_done+0x3d0
zio_execute+0xc3 taskqueue_run_locked+0x74 taskqueue_thread_loop+0x46
fork_exit+0x11f fork_trampoline+0xe

In additional to the above, the taskqueue associated with the above threads has
another 9 pending tasks.

As you can see that "late arrival" code path involves txg_rele_to_sync() where
an instance tc_lock is acquired.

Further, it looks that tc_lock instances are held by the following threads:

64998 101921 pkg              initial thread   mi_switch+0x186 sleepq_wait+0x42
_sx_xlock_hard+0x426 _sx_xlock+0x51 txg_delay+0x9d
dsl_pool_tempreserve_space+0xd5 dsl_dir_tempreserve_space+0x154
dmu_tx_assign+0x370 zfs_freebsd_create+0x310 VOP_CREATE_APV+0x31
vn_open_cred+0x4b7 kern_openat+0x20a amd64_syscall+0x540 Xfast_syscall+0xf7
66152 102491 pkg              initial thread   mi_switch+0x186 sleepq_wait+0x42
_sx_xlock_hard+0x426 _sx_xlock+0x51 txg_delay+0x9d
dsl_pool_tempreserve_space+0xd5 dsl_dir_tempreserve_space+0x154
dmu_tx_assign+0x370 zfs_freebsd_write+0x45b VOP_WRITE_APV+0xb2 vn_write+0x37e
vn_io_fault+0x90 dofilewrite+0x85 kern_writev+0x6c sys_write+0x64
amd64_syscall+0x540 Xfast_syscall+0xf7
75803 101638 find             -                mi_switch+0x186 sleepq_wait+0x42
_sx_xlock_hard+0x426 _sx_xlock+0x51 txg_delay+0x9d
dsl_pool_tempreserve_space+0xd5 dsl_dir_tempreserve_space+0x154
dmu_tx_assign+0x370 zfs_inactive+0x1b7 zfs_freebsd_inactive+0x1a vinactive+0x86
vputx+0x2d8 sys_fchdir+0x356 amd64_syscall+0x540 Xfast_syscall+0xf7
75809 102932 find             -                mi_switch+0x186 sleepq_wait+0x42
_sx_xlock_hard+0x426 _sx_xlock+0x51 txg_delay+0x9d
dsl_pool_tempreserve_space+0xd5 dsl_dir_tempreserve_space+0x154
dmu_tx_assign+0x370 zfs_inactive+0x1b7 zfs_freebsd_inactive+0x1a vinactive+0x86
vputx+0x2d8 sys_fchdir+0x356 amd64_syscall+0x540 Xfast_syscall+0xf7
75813 101515 find             -                mi_switch+0x186 sleepq_wait+0x42
_sx_xlock_hard+0x426 _sx_xlock+0x51 txg_delay+0x9d
dsl_pool_tempreserve_space+0xd5 dsl_dir_tempreserve_space+0x154
dmu_tx_assign+0x370 zfs_inactive+0x1b7 zfs_freebsd_inactive+0x1a vinactive+0x86
vputx+0x2d8 sys_fchdir+0x356 amd64_syscall+0x540 Xfast_syscall+0xf7
77468 101412 update-mime-databas initial thread   mi_switch+0x186
sleepq_wait+0x42 _sx_xlock_hard+0x426 _sx_xlock+0x51 txg_delay+0x9d
dsl_pool_tempreserve_space+0xd5 dsl_dir_tempreserve_space+0x154
dmu_tx_assign+0x370 zfs_freebsd_write+0x45b VOP_WRITE_APV+0xb2 vn_write+0x37e
vn_io_fault+0x90 dofilewrite+0x85 kern_writev+0x6c sys_write+0x64
amd64_syscall+0x540 Xfast_syscall+0xf7

These threads calls txg_delay also because of the high load.
In the code we see that dmu_tx_assign first grabs an instance of tc_lock and
then calls dsl_dir_tempreserve_space.  Also, we see that txg_delay tries to
acquire tx_sync_lock and that's where all these threads are block.

Then we see that txg_sync_thread holds tx_sync_lock, but in its turn it is
blocked waiting for its zio:

 1552 100544 zfskern          txg_thread_enter mi_switch+0x186 sleepq_wait+0x42
_cv_wait+0x112 zio_wait+0x61 dbuf_read+0x5e5 dmu_buf_hold+0xe0 zap_lockdir+0x58
zap_lookup_norm+0x45 zap_lookup+0x2e feature_get_refcount+0x4b
spa_feature_is_active+0x52 dsl_scan_active+0x63 txg_sync_thread+0x20d
fork_exit+0x11f fork_trampoline+0xe

So a summary.
For some completed zio-s their zio_done routines are blocked because of
dmu_sync_late_arrival_done->txg_rele_to_sync->tc_lock.
tc_locks are held by threads in dmu_tx_assign->..->txg_delay where txg_delay is
blocked on tx_sync_lock.
tx_sync_lock is held by txg_sync_thread which waits for its zio to be processed.
The zio is held on queue and is not getting processed because the vdev already
has too many pending/in-progress zio-s.

This theory looks plausible to me, but I'd like to hear what the experts think.
Even more important question is how this situation can be avoided.

-- 
Andriy Gapon