[patch] zfs livelock and thread priorities
Ben Kelly
ben at wanderview.com
Wed Mar 18 12:43:42 PDT 2009
Hello all,
Recently my home server has been going into a livelock state during my
nightly backups. The server has two zfs pools that look like this:
NAME STATE READ WRITE CKSUM
backup ONLINE 0 0 0
mirror ONLINE 0 0 0
label/backup0.eli ONLINE 0 0 0
label/backup1.eli ONLINE 0 0 0
label/backup2.eli ONLINE 0 0 0
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
ad0s3 ONLINE 0 0 0
I've had the tank pool for quite a while, but just recently added the
backup pool. Previously I had been backing up to a smaller disk using
UFS+geli. The server is running a very recent CURRENT.
With this configuration I could get the server into a bad state within
15 minutes by running my rsync backup script. I call it a livelock,
but I'm not really sure if thats the right term. Existing shell
sessions would work for a while, but eventually get wedged. Sometimes
the serial console would stay responsive for a while, but it also
eventually stopped responding to input. The server would continue to
respond to pings and other network traffic. I was not able to log
into new shells.
To try to figure out what was going on I left top running while I
triggered the problem. The following processes were using all the CPU:
PID USERNAME PRI NICE SIZE RES STATE TIME CPU COMMAND
210 root 46 - 0K 20K RUN 0:06 16.70%
txg_thread_enter
1207 root 44 0 3280K 1296K RUN 0:05 13.87% syslogd
889 root 44 0 5436K 2852K RUN 0:04 13.09% apcupsd
33 root 44 - 0K 8K RUN 0:05 12.99% syncer
1696 root 44 0 3280K 1344K RUN 0:01 12.79% syslogd
727 root 44 0 3280K 1496K RUN 0:01 12.60% syslogd
2065 root 44 0 3280K 1344K RUN 0:01 12.60% syslogd
209 root 46 - 0K 8K tx->tx 0:02 4.79%
txg_thread_enter
2485 root 44 0 4636K 2664K RUN 0:00 0.10% top
11 root 171 ki31 0K 8K RUN 0:15 0.00% idle
The exact processes varied, but txg_thread_enter was always in the
mix. The syncer was not always present. I also noticed that lower in
the list was a spa_zio process set to RUN:
2458 root 49 - 0K 8K RUN 0:00 0.00% spa_zio
Using the serial console I broke to the debugger and got some stack
traces. All of the spinning processes were in some form of
txg_wait_open() or txg_thread_wait(). For example:
Tracing command syslogd pid 2065 tid 100174 td 0xa3a568c0
sched_switch(a3a568c0,0,104,f744824c,17e,...) at sched_switch+0x1fc
mi_switch(104,0,a3a568c0,a3a568c0,dfc729a0,...) at mi_switch+0x12a
sleepq_switch(a3a568c0,0,9e88745b,247,0,...) at sleepq_switch+0xcb
sleepq_wait(a35181f0,0,a321b2ab,1,0,...) at sleepq_wait+0x39
_cv_wait(a35181f0,a3518180,a321b186,1d3,a35181f0,...) at _cv_wait+0x188
txg_wait_open(a3518000,2542a87,0,0,cc00,...) at txg_wait_open+0xb5
dmu_tx_wait(ab4d7a00,2,0,cbb5,0,...) at dmu_tx_wait+0x138
zfs_freebsd_write(dfc72be0,0,9e88c07e,0,a62f5df4,...) at
zfs_freebsd_write+0x3e0
VOP_WRITE_APV(a3221580,dfc72be0,9e88c07e,25a,dfc72c70,...) at
VOP_WRITE_APV+0xa6
vn_write(a5689690,bcf74780,a3ec7700,0,a3a568c0,...) at vn_write+0x1b4
dofilewrite(bcf74780,ffffffff,ffffffff,0,a5689690,...) at dofilewrite
+0x97
kern_writev(a3a568c0,e,bcf74780,bcf74780,0,...) at kern_writev+0x58
writev(a3a568c0,dfc72cf8,c,9e5f0500,a3a568c0,...) at writev+0x46
syscall(dfc72d38) at syscall+0x325
Xint0x80_syscall() at Xint0x80_syscall+0x20
The one exception was a single txg_thread_enter process that appeared
to be waiting on zio:
Tracing command txg_thread_enter pid 2469 tid 100252 td 0xa48f4690
sched_switch(a48f4690,0,104,9cb79a82,139,...) at sched_switch+0x1fc
mi_switch(104,0,a48f4690,a48f4690,dfd7ba80,...) at mi_switch+0x12a
sleepq_switch(a48f4690,0,9e88745b,247,0,...) at sleepq_switch+0xcb
sleepq_wait(adff56ec,0,a321cffb,1,0,...) at sleepq_wait+0x39
_cv_wait(adff56ec,adff56d8,a321cf9b,3fe,3b9aca00,...) at _cv_wait+0x188
zio_wait(adff54a8,3ec8,0,0,9f055a60,...) at zio_wait+0x62
dsl_pool_sync(a2d0a400,3ec8,0,0,0,...) at dsl_pool_sync+0x112
--More-- spa_sync(a2ff5000,3ec8,0,0,3e8,...) at spa_sync+0x3b5
txg_sync_thread(a2d0a400,dfd7bd38,0,0,0,...) at txg_sync_thread+0x3b4
fork_exit(a3182790,a2d0a400,dfd7bd38) at fork_exit+0x90
fork_trampoline() at fork_trampoline+0x8
All of the spa_zio processes were blocked waiting on task queues
except for the one spa_zio process marked RUN. It appeared to have
been interrupted:
Tracing command spa_zio pid 2458 tid 100245 td 0xa48d1d20
sched_switch(a48d1d20,0,602,d1125c99,17e,...) at sched_switch+0x1fc
mi_switch(602,0,9e884bce,bc,0,...) at mi_switch+0x12a
--More--
critical_exit(9e8f29a8,a48d1d20,9e8f29a8,a2c33580,e,...) at
critical_exit+0x92
intr_event_handle(a2c33580,dfd62b54,dfd62b48,c3c85000,e,...) at
intr_event_handle+0xaa
intr_execute_handlers
(9e8f29a8,dfd62b54,c3c85000,c381b284,dfd62c10,...) at
intr_execute_handlers+0x>
atpic_handle_intr(e,dfd62b54) at atpic_handle_intr+0x67
Xatpic_intr14() at Xatpic_intr14+0x21
--- interrupt, eip = 0x9e821e7e, esp = 0xdfd62b94, ebp = 0xdfd62c10 ---
generic_bcopy(a4c4ab68,c1757254,a321c015,134,a4c4ab2c,...) at
generic_bcopy+0x1a
vdev_queue_io_done(c1757254,0,a324aad8,a9d0c4c0,a9d0c4c0,...) at
vdev_queue_io_done+0xb2
zio_vdev_io_done(c1757254,a31316e2,a324aad8,0,a3214fd8,...) at
zio_vdev_io_done+0x72
zio_execute(c1757254,0,a3214fd8,352,6a7334a1,...) at zio_execute+0x62
taskq_thread(a324aab8,dfd62d38,0,0,0,...) at taskq_thread+0x1a0
fork_exit(a3132030,a324aab8,dfd62d38) at fork_exit+0x90
fork_trampoline() at fork_trampoline+0x8
Across all the livelocks I triggered the spa_zio process was always
marked to RUN and appeared to have been interrupted. Also, if you
look at the top output it is running at a lower priority than the
other spinning threads.
Its not clear to me why the top threads were spinning in the txg
calls. The zfs code, however, does spawn the txg threads at
minclsyspri and the zio threads at maxclsyspri. It seems that their
intention was that the zio threads should preempt the txg processing.
Currently in FreeBSD, however, these priority values are ignored. In
this case, at least, it looks like the priorities may be required by
the zfs design.
To test this theory I applied the attached patch to sets the zfs
thread priorities. I chose the priority values to be similar to
PRIBIO but so that minclsyspri is still lower priority than
maxclsyspri. The exact values are somewhat arbitrary. With these
changes I have not been able to trigger the livelock condition again.
(Also, it occurred to me that this might make the arc_reclaim_thread
more responsive to system memory pressure.)
I should note, however, that the livelock failure is somewhat timing
dependent. When I enabled WITNESS at one point it changed the timing
to the point where the livelock would only occur after running the
backup for hours instead of minutes. Its possible the thread priority
change has only changed the timing, but not really fixed the problem.
Also, this patch might affect dtrace, but I have not tested that.
Does my analysis look reasonable? Should the spa_zio thread have been
rescheduled even without the priority change? Is there a better way
to attack this problem? Any input would be greatly appreciated.
Thank you.
- Ben
-------------- next part --------------
A non-text attachment was scrubbed...
Name: zfs_thread_priority.diff
Type: application/octet-stream
Size: 2237 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-current/attachments/20090318/60ee1e6a/zfs_thread_priority.obj
-------------- next part --------------
More information about the freebsd-current
mailing list