I/O to pool appears to be hung, panic !
Ben RUBSON
ben.rubson at gmail.com
Thu Jun 29 11:49:05 UTC 2017
Hello,
One of my servers did a kernel panic last night, giving the following message :
panic: I/O to pool 'home' appears to be hung on vdev guid 122... at '/dev/label/G23iscsi'.
Pool is made like this :
home
mirror
label/G13local
label/G14local
label/G23iscsi <-- busy disk
label/G24iscsi
mirror
label/G15local
label/G16local
label/G25iscsi
label/G26iscsi
cache
label/G10local
label/G11local
Kernel is then complaining about one of the 4 iSCSI disks in the pool.
All these 4 disks come from another identical FreeBSD system (40G "no latency" link).
Here are some numbers regarding this disk, taken from the server hosting the pool :
(unfortunately not from the iscsi target server)
https://s23.postimg.org/zd8jy9xaj/busydisk.png
We clearly see that suddendly, disk became 100% busy, meanwhile CPU was almost idle.
No error message at all on both servers.
SMART from the target disk is nice :
SMART Health Status: OK
Current Drive Temperature: 32 C
Drive Trip Temperature: 85 C
Manufactured in week 22 of year 2016
Specified cycle count over device lifetime: 50000
Accumulated start-stop cycles: 18
Specified load-unload count over device lifetime: 600000
Accumulated load-unload cycles: 2362
Elements in grown defect list: 0
Vendor (Seagate) cache information
Blocks sent to initiator = 5938879802638336
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 14 0 14 488481 74496.712 0
write: 0 0 0 0 126701 18438.443 0
verify: 0 0 0 0 20107 0.370 0
Non-medium error count: 0
SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background long Completed - 7943 - [- - -]
# 2 Background long Completed - 7607 - [- - -]
# 3 Background long Completed - 7271 - [- - -]
The only log I have is the following stacktrace taken from the server console :
panic: I/O to pool 'home' appears to be hung on vdev guid 122... at '/dev/label/G23iscsi'.
cpuid = 0
KDB: stack backtrace:
#0 0xffffffff80b240f7 at kdb_backtrace+0x67
#1 0xffffffff80ad9462 at vpanic+0x182
#2 0xffffffff80ad92d3 at panic+0x43
#3 0xffffffff82238fa7 at vdev_deadman+0x127
#4 0xffffffff82238ec0 at vdev_deadman+0x40
#5 0xffffffff82238ec0 at vdev_deadman+0x40
#6 0xffffffff8222d0a6 at spa_deadman+0x86
#7 0xffffffff80af32da at softclock_call_cc+0x18a
#8 0xffffffff80af3854 at softclock+0x94
#9 0xffffffff80a9348f at intr_event_execute_handlers+0x20f
#10 0xffffffff80a936f6 at ithread_loop+0xc6
#11 0xffffffff80a900d5 at fork_exit+0x85
#12 0xffffffff80f846fe at fork_trampoline+0xe
Uptime: 92d2h47m6s
I would have been pleased to make a dump available.
However, despite my (correct ?) configuration, server did not dump :
(nevertheless, "sysctl debug.kdb.panic=1" make it to dump)
# grep ^dump /boot/loader.conf /etc/rc.conf
/boot/loader.conf:dumpdev="/dev/mirror/swap"
/etc/rc.conf:dumpdev="AUTO"
# gmirror list swap
Components: 2
Balance: prefer
Providers:
1. Name: mirror/swap
Mediasize: 8589934080 (8.0G)
Consumers:
1. Name: label/swap1
State: ACTIVE
Priority: 0
2. Name: label/swap2
State: ACTIVE
Priority: 1
I use default kernel, with a rebuilt zfs module :
# uname -v
FreeBSD 11.0-RELEASE-p8 #0: Wed Feb 22 06:12:04 UTC 2017 root at amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC
I use the following iSCSI configuration, which disconnects the disks "as soon as" they are unavailable :
kern.iscsi.ping_timeout=5
kern.iscsi.fail_on_disconnection=1
kern.iscsi.iscsid_timeout=5
I then think disk was at least correctly reachable during these 20 busy minutes.
So, any idea why I could have faced this issue ?
I would have thought ZFS would have taken the busy device offline, instead of raising a panic.
Perhaps it is already possible to make ZFS behave like this ?
Thank you very much for your help & support !
Best regards,
Ben
More information about the freebsd-scsi
mailing list