zio->io_cv deadlock
Kevin Day
toasty at dragondata.com
Tue Mar 17 16:04:59 PDT 2009
I've got a test environment where I can make this deadlock happen
within 3-4 hours of use now. This is from -CURRENT as of yesterday.
This server isn't trying to use zfs on root, so when it hangs I'm not
quite as bad off. Here's a ps output:
USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME
COMMAND UID PPID CPU PRI NI MWCHAN SL RE PAGEIN LIM TSIZ
root 477 0.0 0.0 2180 656 ?? Is 9:34AM 0:00.00 /sbin/
devd 0 1 0 44 0 select 127 127 0 - 336
root 593 0.0 0.0 5780 1444 ?? Is 9:35AM 0:00.05 /usr/
sbin/syslog 0 1 0 44 0 select 21 127 0 - 36
root 811 0.0 0.0 24872 4172 ?? Is 9:35AM 0:00.00 /usr/
sbin/sshd 0 1 0 44 0 select 127 127 0 - 220
root 837 0.0 0.0 6836 1532 ?? Is 9:35AM 0:00.07 /usr/
sbin/cron - 0 1 0 44 0 nanslp 27 127 0 - 36
root 974 0.0 0.0 8232 2484 ?? Ss 9:38AM 1:19.01
screen 0 1 0 44 0 select 0 127 6 - 292
root 1612 0.0 0.0 38852 7856 ?? Ss 2:49PM 0:00.16 sshd:
root at pts/0 0 811 0 44 0 select 0 127 0 - 220
root 1617 0.0 0.0 10188 2792 0 Is 2:49PM 0:00.01 -csh
(csh) 0 1612 0 47 0 pause 127 127 0 - 304
root 1622 0.0 0.0 8232 2132 0 S+ 2:49PM 0:00.03 screen -
x 0 1617 0 44 0 pause 1 127 0 - 292
root 975 0.0 0.0 10188 2704 1 Is 9:38AM 0:00.01 /bin/
csh 0 974 0 49 0 pause 127 127 0 - 304
root 980 0.0 1.1 794196 766248 1 D+ 9:38AM 66:50.20 rsync -
ravH root 0 975 0 69 0 zio->i 127 127 1 - 344
root 982 0.0 0.2 181844 142444 1 I+ 9:38AM 69:32.28 rsync -
ravH root 0 980 0 44 0 select 57 127 0 - 344
root 983 0.0 0.0 10188 2788 2 Ss 9:38AM 0:00.01 /bin/
csh 0 974 0 44 0 pause 0 127 0 - 304
root 1 0.0 0.0 2176 596 ?? ILs 9:34AM 0:00.01 /sbin/
init -- 0 0 0 44 0 wait 127 127 8 - 604
root 827 0.0 0.0 10796 3800 ?? Ss 9:35AM 0:00.50
sendmail: accept 0 1 0 44 0 select 3 127 1 - 628
smmsp 831 0.0 0.0 10796 3844 ?? Is 9:35AM 0:00.01
sendmail: Queue 25 1 0 44 0 pause 127 127 0 - 628
root 887 0.0 0.0 5776 1224 v0 Is+ 9:35AM 0:00.01 /usr/
libexec/get 0 1 0 76 0 ttyin 127 127 0 - 20
root 888 0.0 0.0 5776 1224 v1 Is+ 9:35AM 0:00.00 /usr/
libexec/get 0 1 0 76 0 ttyin 127 127 0 - 20
root 889 0.0 0.0 5776 1224 v2 Is+ 9:35AM 0:00.01 /usr/
libexec/get 0 1 0 76 0 ttyin 127 127 0 - 20
root 890 0.0 0.0 5776 1224 v3 Is+ 9:35AM 0:00.00 /usr/
libexec/get 0 1 0 76 0 ttyin 127 127 2 - 20
root 891 0.0 0.0 5776 1224 v4 Is+ 9:35AM 0:00.01 /usr/
libexec/get 0 1 0 76 0 ttyin 127 127 0 - 20
root 892 0.0 0.0 5776 1224 v5 Is+ 9:35AM 0:00.01 /usr/
libexec/get 0 1 0 76 0 ttyin 127 127 0 - 20
root 893 0.0 0.0 5776 1224 v6 Is+ 9:35AM 0:00.01 /usr/
libexec/get 0 1 0 76 0 ttyin 127 127 0 - 20
root 894 0.0 0.0 5776 1224 v7 Is+ 9:35AM 0:00.01 /usr/
libexec/get 0 1 0 76 0 ttyin 127 127 0 - 20
root 981 0.0 0.0 27668 11220 1 S+ 9:38AM 172:25.63 ssh -l
root 216. 0 980 0 44 0 select 1 127 0 - 120
root 2112 0.0 0.0 6892 1416 2 R+ 6:48PM 0:00.00 ps
auxwwlv 0 983 0 44 0 - 127 0 0 - 28
root 0 0.0 0.0 0 128 ?? DLs 9:34AM 46:56.19
[kernel] 0 0 0 -68 0 - 127 127 0 - 0
root 2 0.0 0.0 0 16 ?? DL 9:34AM 0:01.23
[g_event] 0 0 0 -8 0 - 0 127 0 - 0
root 3 0.0 0.0 0 16 ?? DL 9:34AM 0:26.74
[g_up] 0 0 0 -8 0 - 0 127 0 - 0
root 4 0.0 0.0 0 16 ?? DL 9:34AM 5:51.49
[g_down] 0 0 0 -8 0 - 0 127 0 - 0
root 5 0.0 0.0 0 16 ?? DL 9:34AM 0:00.00
[xpt_thrd] 0 0 0 -16 0 ccb_sc 127 127 0 - 0
root 6 0.0 0.0 0 16 ?? DL 9:34AM 0:00.14
[fdc0] 0 0 0 -16 0 - 0 127 0 - 0
root 7 0.0 0.0 0 24 ?? DL 9:34AM 0:00.00
[sctp_iterator] 0 0 0 -16 0 waitin 127 127 0 - 0
root 8 0.0 0.0 0 16 ?? DL 9:34AM 0:00.03
[pagedaemon] 0 0 0 -16 0 psleep 4 127 0 - 0
root 9 0.0 0.0 0 16 ?? DL 9:34AM 0:00.00
[vmdaemon] 0 0 0 -16 0 psleep 127 127 0 - 0
root 10 0.0 0.0 0 16 ?? DL 9:34AM 0:00.00
[audit] 0 0 0 -16 0 audit_ 127 127 0 - 0
root 11 800.0 0.0 0 128 ?? RL 9:34AM 3958:43.82
[idle] 0 0 0 171 0 - 127 127 0 - 0
root 12 0.0 0.0 0 400 ?? WL 9:34AM 6:03.01
[intr] 0 0 0 -60 0 - 127 127 0 - 0
root 13 0.0 0.0 0 16 ?? DL 9:34AM 0:44.77
[yarrow] 0 0 0 44 0 - 0 127 0 - 0
root 14 0.0 0.0 0 16 ?? DL 9:34AM 0:00.00
[usbus0] 0 0 0 -64 0 wmsg 127 127 0 - 0
root 15 0.0 0.0 0 16 ?? DL 9:34AM 0:00.92
[usbus0] 0 0 0 -68 0 wmsg 2 127 0 - 0
root 16 0.0 0.0 0 16 ?? DL 9:34AM 0:00.56
[usbus0] 0 0 0 -68 0 wmsg 2 127 0 - 0
root 17 0.0 0.0 0 16 ?? DL 9:34AM 0:00.55
[usbus0] 0 0 0 -64 0 wmsg 2 127 0 - 0
root 18 0.0 0.0 0 16 ?? DL 9:34AM 0:00.00
[usbus1] 0 0 0 -64 0 wmsg 127 127 0 - 0
root 19 0.0 0.0 0 16 ?? DL 9:34AM 0:00.92
[usbus1] 0 0 0 -68 0 wmsg 2 127 0 - 0
root 20 0.0 0.0 0 16 ?? DL 9:34AM 0:00.52
[usbus1] 0 0 0 -68 0 wmsg 2 127 0 - 0
root 21 0.0 0.0 0 16 ?? DL 9:34AM 0:00.60
[usbus1] 0 0 0 -64 0 wmsg 2 127 0 - 0
root 22 0.0 0.0 0 16 ?? DL 9:34AM 0:00.00
[pagezero] 0 0 0 76 0 pgzero 127 127 0 - 0
root 23 0.0 0.0 0 16 ?? DL 9:34AM 0:00.14
[bufdaemon] 0 0 0 -16 0 psleep 0 127 0 - 0
root 24 0.0 0.0 0 16 ?? DL 9:34AM 0:24.78
[syncer] 0 0 0 44 0 zfsvfs 127 127 0 - 0
root 25 0.0 0.0 0 16 ?? DL 9:34AM 0:02.17
[vnlru] 0 0 0 44 0 vlruwt 0 127 0 - 0
root 26 0.0 0.0 0 16 ?? DL 9:34AM 0:00.21
[softdepflush] 0 0 0 -16 0 sdflus 0 127 0 - 0
root 88 0.0 0.0 0 16 ?? DL 9:34AM 0:00.00
[system_taskq] 0 0 0 -16 0 tq->tq 127 127 0 - 0
root 89 0.0 0.0 0 16 ?? DL 9:34AM 0:00.00
[system_taskq] 0 0 0 -16 0 tq->tq 127 127 0 - 0
root 90 0.0 0.0 0 16 ?? DL 9:34AM 0:00.00
[system_taskq] 0 0 0 -16 0 tq->tq 127 127 0 - 0
root 91 0.0 0.0 0 16 ?? DL 9:34AM 0:00.00
[system_taskq] 0 0 0 -16 0 tq->tq 127 127 0 - 0
root 92 0.0 0.0 0 16 ?? DL 9:34AM 0:00.00
[system_taskq] 0 0 0 -16 0 tq->tq 127 127 0 - 0
root 93 0.0 0.0 0 16 ?? DL 9:34AM 0:00.00
[system_taskq] 0 0 0 -16 0 tq->tq 127 127 0 - 0
root 94 0.0 0.0 0 16 ?? DL 9:34AM 0:00.00
[system_taskq] 0 0 0 -16 0 tq->tq 127 127 0 - 0
root 95 0.0 0.0 0 16 ?? DL 9:34AM 0:00.00
[system_taskq] 0 0 0 -16 0 tq->tq 127 127 0 - 0
root 104 0.0 0.0 0 16 ?? DL 9:34AM 5:11.71
[arc_reclaim_thr 0 0 0 44 0 arc_re 0 127 0 - 0
root 105 0.0 0.0 0 16 ?? DL 9:34AM 0:00.15
[l2arc_feed_thre 0 0 0 -16 0 l2arc_ 0 127 0 - 0
root 936 0.0 0.0 0 16 ?? DL 9:37AM 0:00.55
[spa_zio] 0 0 0 -16 0 tq->tq 127 127 0 - 0
root 937 0.0 0.0 0 16 ?? DL 9:37AM 0:00.17
[spa_zio] 0 0 0 -16 0 tq->tq 127 127 0 - 0
root 938 0.0 0.0 0 16 ?? DL 9:37AM 0:00.00
[spa_zio] 0 0 0 -16 0 tq->tq 127 127 0 - 0
root 939 0.0 0.0 0 16 ?? DL 9:37AM 0:08.89
[spa_zio] 0 0 0 44 0 tq->tq 127 127 0 - 0
root 940 0.0 0.0 0 16 ?? DL 9:37AM 0:31.71
[spa_zio] 0 0 0 44 0 tq->tq 127 127 0 - 0
root 941 0.0 0.0 0 16 ?? DL 9:37AM 0:08.91
[spa_zio] 0 0 0 44 0 tq->tq 127 127 0 - 0
root 942 0.0 0.0 0 16 ?? DL 9:37AM 0:08.76
[spa_zio] 0 0 0 44 0 tq->tq 127 127 0 - 0
root 943 0.0 0.0 0 16 ?? DL 9:37AM 0:08.79
[spa_zio] 0 0 0 44 0 tq->tq 127 127 0 - 0
root 944 0.0 0.0 0 16 ?? DL 9:37AM 0:09.21
[spa_zio] 0 0 0 44 0 tq->tq 127 127 0 - 0
root 945 0.0 0.0 0 16 ?? DL 9:37AM 0:08.83
[spa_zio] 0 0 0 44 0 tq->tq 127 127 0 - 0
root 946 0.0 0.0 0 16 ?? DL 9:37AM 0:08.88
[spa_zio] 0 0 0 44 0 tq->tq 127 127 0 - 0
root 947 0.0 0.0 0 16 ?? DL 9:37AM 1:34.28
[spa_zio] 0 0 0 44 0 tq->tq 127 127 0 - 0
root 948 0.0 0.0 0 16 ?? DL 9:37AM 1:34.28
[spa_zio] 0 0 0 44 0 tq->tq 127 127 0 - 0
root 949 0.0 0.0 0 16 ?? DL 9:37AM 1:34.25
[spa_zio] 0 0 0 44 0 tq->tq 127 127 0 - 0
root 950 0.0 0.0 0 16 ?? DL 9:37AM 1:34.30
[spa_zio] 0 0 0 44 0 tq->tq 127 127 0 - 0
root 951 0.0 0.0 0 16 ?? DL 9:37AM 1:34.74
[spa_zio] 0 0 0 44 0 tq->tq 127 127 0 - 0
root 952 0.0 0.0 0 16 ?? DL 9:37AM 1:34.38
[spa_zio] 0 0 0 44 0 tq->tq 127 127 0 - 0
root 953 0.0 0.0 0 16 ?? DL 9:37AM 1:34.36
[spa_zio] 0 0 0 44 0 tq->tq 127 127 0 - 0
root 954 0.0 0.0 0 16 ?? DL 9:37AM 1:34.14
[spa_zio] 0 0 0 44 0 tq->tq 127 127 0 - 0
root 955 0.0 0.0 0 16 ?? DL 9:37AM 2:46.35
[spa_zio] 0 0 0 44 0 zfsvfs 127 127 0 - 0
root 956 0.0 0.0 0 16 ?? DL 9:37AM 0:00.00
[spa_zio] 0 0 0 -16 0 tq->tq 127 127 0 - 0
root 957 0.0 0.0 0 16 ?? DL 9:37AM 0:00.00
[spa_zio] 0 0 0 -16 0 tq->tq 127 127 0 - 0
root 958 0.0 0.0 0 16 ?? DL 9:37AM 0:00.00
[spa_zio] 0 0 0 -16 0 tq->tq 127 127 0 - 0
root 959 0.0 0.0 0 16 ?? DL 9:37AM 0:00.00
[spa_zio] 0 0 0 -16 0 tq->tq 127 127 0 - 0
root 960 0.0 0.0 0 16 ?? DL 9:37AM 0:00.00
[spa_zio] 0 0 0 -16 0 tq->tq 127 127 0 - 0
root 961 0.0 0.0 0 16 ?? DL 9:37AM 0:00.11
[spa_zio] 0 0 0 -16 0 tq->tq 127 127 0 - 0
root 962 0.0 0.0 0 16 ?? DL 9:37AM 0:24.24
[vdev:worker da1 0 0 0 44 0 vgeom: 127 127 0 - 0
root 963 0.0 0.0 0 16 ?? DL 9:37AM 0:00.21
[txg_thread_ente 0 0 0 -16 0 tx->tx 127 127 0 - 0
root 964 0.0 0.0 0 28 ?? DL 9:37AM 1:19.07
[txg_thread_ente 0 0 0 44 0 tx->tx 127 127 0 - 0
root 965 0.0 0.0 0 16 ?? DL 9:37AM 0:12.34
[zil_clean] 0 0 0 -16 0 tq->tq 127 127 0 - 0
Note that syncer and one spa_zio are stuck in zfsvfs, and my rsync
process is frozen in zio->io_cv.
zpool makes everything look okay:
cs03# zpool iostat -v
capacity operations bandwidth
pool used avail read write read write
---------- ----- ----- ----- ----- ----- -----
z 549G 13.0T 14 133 709K 12.9M
da1 549G 13.0T 14 133 709K 12.9M
---------- ----- ----- ----- ----- ----- -----
cs03# zpool status -v
pool: z
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
z ONLINE 0 0 0
da1 ONLINE 0 0 0
errors: No known data errors
ZFS related sysctls:
vfs.zfs.arc_meta_limit: 26214400
vfs.zfs.arc_meta_used: 70401408
vfs.zfs.mdcomp_disable: 0
vfs.zfs.arc_min: 140928537
vfs.zfs.arc_max: 104857600
vfs.zfs.zfetch.array_rd_sz: 1048576
vfs.zfs.zfetch.block_cap: 256
vfs.zfs.zfetch.min_sec_reap: 2
vfs.zfs.zfetch.max_streams: 8
vfs.zfs.prefetch_disable: 0
vfs.zfs.recover: 0
vfs.zfs.txg.synctime: 5
vfs.zfs.txg.timeout: 30
vfs.zfs.scrub_limit: 10
vfs.zfs.vdev.cache.bshift: 16
vfs.zfs.vdev.cache.size: 10485760
vfs.zfs.vdev.cache.max: 16384
vfs.zfs.vdev.aggregation_limit: 131072
vfs.zfs.vdev.ramp_rate: 2
vfs.zfs.vdev.time_shift: 6
vfs.zfs.vdev.min_pending: 4
vfs.zfs.vdev.max_pending: 35
vfs.zfs.cache_flush_disable: 0
vfs.zfs.zil_disable: 0
vfs.zfs.version.zpl: 3
vfs.zfs.version.vdev_boot: 1
vfs.zfs.version.spa: 13
vfs.zfs.version.dmu_backup_stream: 1
vfs.zfs.version.dmu_backup_header: 2
vfs.zfs.version.acl: 1
vfs.zfs.debug: 0
vfs.zfs.super_owner: 0
kstat.zfs.misc.arcstats.hits: 57819155
kstat.zfs.misc.arcstats.misses: 5590858
kstat.zfs.misc.arcstats.demand_data_hits: 63981
kstat.zfs.misc.arcstats.demand_data_misses: 635
kstat.zfs.misc.arcstats.demand_metadata_hits: 47525277
kstat.zfs.misc.arcstats.demand_metadata_misses: 4114419
kstat.zfs.misc.arcstats.prefetch_data_hits: 19665
kstat.zfs.misc.arcstats.prefetch_data_misses: 809
kstat.zfs.misc.arcstats.prefetch_metadata_hits: 10210232
kstat.zfs.misc.arcstats.prefetch_metadata_misses: 1474995
kstat.zfs.misc.arcstats.mru_hits: 11018143
kstat.zfs.misc.arcstats.mru_ghost_hits: 937056
kstat.zfs.misc.arcstats.mfu_hits: 37133185
kstat.zfs.misc.arcstats.mfu_ghost_hits: 3428689
kstat.zfs.misc.arcstats.deleted: 7789751
kstat.zfs.misc.arcstats.recycle_miss: 6302888
kstat.zfs.misc.arcstats.mutex_miss: 9406
kstat.zfs.misc.arcstats.evict_skip: 418277598
kstat.zfs.misc.arcstats.hash_elements: 9872
kstat.zfs.misc.arcstats.hash_elements_max: 124517
kstat.zfs.misc.arcstats.hash_collisions: 112246
kstat.zfs.misc.arcstats.hash_chains: 96
kstat.zfs.misc.arcstats.hash_chain_max: 3
kstat.zfs.misc.arcstats.p: 92080153
kstat.zfs.misc.arcstats.c: 140928537
kstat.zfs.misc.arcstats.c_min: 140928537
kstat.zfs.misc.arcstats.c_max: 104857600
kstat.zfs.misc.arcstats.size: 141645696
kstat.zfs.misc.arcstats.hdr_size: 2617264
kstat.zfs.misc.arcstats.l2_hits: 0
kstat.zfs.misc.arcstats.l2_misses: 0
kstat.zfs.misc.arcstats.l2_feeds: 0
kstat.zfs.misc.arcstats.l2_rw_clash: 0
kstat.zfs.misc.arcstats.l2_writes_sent: 0
kstat.zfs.misc.arcstats.l2_writes_done: 0
kstat.zfs.misc.arcstats.l2_writes_error: 0
kstat.zfs.misc.arcstats.l2_writes_hdr_miss: 0
kstat.zfs.misc.arcstats.l2_evict_lock_retry: 0
kstat.zfs.misc.arcstats.l2_evict_reading: 0
kstat.zfs.misc.arcstats.l2_free_on_write: 0
kstat.zfs.misc.arcstats.l2_abort_lowmem: 0
kstat.zfs.misc.arcstats.l2_cksum_bad: 0
kstat.zfs.misc.arcstats.l2_io_error: 0
kstat.zfs.misc.arcstats.l2_size: 0
kstat.zfs.misc.arcstats.l2_hdr_size: 0
kstat.zfs.misc.arcstats.memory_throttle_count: 0
kstat.zfs.misc.vdev_cache_stats.delegations: 671323
kstat.zfs.misc.vdev_cache_stats.hits: 4416731
kstat.zfs.misc.vdev_cache_stats.misses: 349266
There does seem to be something stuck in the syncer:
vfs.worklist_len: 11
(that doesn't go down or move at all), but that doesn't tell me much.
Next time this happens, is there anything else I should look at?
-- Kevin
On Feb 8, 2009, at 10:59 PM, Kevin Day wrote:
>
> I'm playing with a -CURRENT install from a couple of weeks ago.
> Everything seems okay for a few days, then eventually every process
> ends up stuck in zio->io_cv. If I go to the console, it's responsive
> until I try logging in, then login is stuck in zio->io_cv as well.
> Ctrl-Alt-Esc drops me into ddb, but then ddb hangs instantly.
>
> Nothing on the console or syslog before it hangs.
>
> Anyone seen anything similar?
>
> -- Kevin
>
>
> Possibly relevant info:
>
> 8 core Opteron
> 64GB RAM
>
> da1 at twa0 bus 0 target 0 lun 1
> da1: <AMCC 9650SE-12M DISK 3.08> Fixed Direct Access SCSI-5 device
> da1: 100.000MB/s transfers
> da1: 4678158MB (9580867585 512 byte sectors: 255H 63S/T 596381C)
>
> server5# zpool list
> NAME SIZE USED AVAIL CAP HEALTH ALTROOT
> z 4.44T 1.19T 3.25T 26% ONLINE -
>
> server5# zpool status -v
> pool: z
> state: ONLINE
> scrub: none requested
> config:
>
> NAME STATE READ WRITE CKSUM
> z ONLINE 0 0 0
> da1 ONLINE 0 0 0
>
> errors: No known data errors
>
> server5# cat /boot/loader.conf
> vm.kmem_size_max="2048M"
> vm.kmem_size="2048M"
> vfs.zfs.arc_max="100M"
> zfs_load="YES"
> vfs.root.mountfrom="zfs:z"
>
>
> (tried lowering arc_max, didn't seem to help)
>
>
> _______________________________________________
> freebsd-fs at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd.org"
More information about the freebsd-fs
mailing list