ZFS/istgt lockup

Fri Oct 5 13:13:56 UTC 2012

Copying this reply to -scsi. Not sure if it's more of a zfs issue
or istgt... more below...

----- John's Original Message -----
> Hi Folks,
> 
>    I've been chasing a problem that I'm not quite sure originates
> on the BSD side, but the system shouldn't lock up and require a power
> cycle to reboot.
> 
>    The config:  I have a bsd system running 9.1RC handing out a
> 36TB volume to a Linux RHEL 6.1 system. The RHEL 6.1 systems is
> doing heavy I/O & number crunching. Many hours into the job stream
> the kernel becomes quite unhappy:
> 
> kernel: __ratelimit: 27665 callbacks suppressed
> kernel: swapper: page allocation failure. order:1, mode:0x4020
> kernel: Pid: 0, comm: swapper Tainted: G           ---------------- T 2.6.32-131.0.15.el6.x86_64 #1
> kernel: Call Trace:
> kernel: <IRQ>  [<ffffffff811201e6>] ? __alloc_pages_nodemask+0x716/0x8b0
> kernel: [<ffffffff8115464a>] ? alloc_pages_current+0xaa/0x110
> kernel: [<ffffffffa0115cd5>] ? refill_fl+0x3d5/0x4a0 [cxgb3]
> kernel: [<ffffffff814200bd>] ? napi_frags_finish+0x6d/0xb0
> kernel: [<ffffffffa0116d33>] ? process_responses+0x653/0x1450 [cxgb3]
> kernel: [<ffffffff810e7f62>] ? ring_buffer_lock_reserve+0xa2/0x160
> kernel: [<ffffffffa0117b6c>] ? napi_rx_handler+0x3c/0x90 [cxgb3]
> kernel: [<ffffffff814225a3>] ? net_rx_action+0x103/0x2f0
> kernel: [<ffffffff8106f717>] ? __do_softirq+0xb7/0x1e0
> kernel: [<ffffffff810d69d6>] ? handle_IRQ_event+0xf6/0x170
> kernel: [<ffffffff8100c2cc>] ? call_softirq+0x1c/0x30
> kernel: [<ffffffff8100df05>] ? do_softirq+0x65/0xa0
> kernel: [<ffffffff8106f505>] ? irq_exit+0x85/0x90
> kernel: [<ffffffff814e3505>] ? do_IRQ+0x75/0xf0
> kernel: [<ffffffff8100bad3>] ? ret_from_intr+0x0/0x11
> kernel: <EOI>  [<ffffffff810362ab>] ? native_safe_halt+0xb/0x10
> kernel: [<ffffffff81100826>] ? ftrace_raw_event_power_start+0x16/0x20
> kernel: [<ffffffff810142fd>] ? default_idle+0x4d/0xb0
> kernel: [<ffffffff81009e96>] ? cpu_idle+0xb6/0x110
> kernel: [<ffffffff814d493c>] ? start_secondary+0x202/0x245
> 
>    On the bsd side, the istgt daemon appears to see that one of the
> connection threads is down and attempts to restart it. At this point,
> the istgt process size starts to grow.
> 
> USER   PID  %CPU %MEM     VSZ    RSS TT  STAT STARTED       TIME COMMAND
> root  1224   0.0  0.4 8041092 405472 v0- DL    4:59PM   15:28.72 /usr/local/bin/istgt
> root  1224   0.0  0.4 8041092 405472 v0- IL    4:59PM   63:18.34 /usr/local/bin/istgt
> root  1224   0.0  0.4 8041092 405472 v0- IL    4:59PM   61:13.80 /usr/local/bin/istgt
> root  1224   0.0  0.4 8041092 405472 v0- IL    4:59PM    0:00.00 /usr/local/bin/istgt
> 
>    There are more than 1400 threads reported.
> 
>    Also of interest, netstat shows:
> 
> tcp4       0      0 10.59.6.12.5010        10.59.25.113.54076     CLOSE_WAIT
> tcp4       0      0 10.60.6.12.5010        10.60.25.113.33345     CLOSED
> tcp4       0      0 10.59.6.12.5010        10.59.25.113.54074     CLOSE_WAIT
> tcp4       0      0 10.60.6.12.5010        10.60.25.113.33343     CLOSED
> tcp4       0      0 10.59.6.12.5010        10.59.25.113.54072     CLOSE_WAIT
> tcp4       0      0 10.60.6.12.5010        10.60.25.113.33341     CLOSED
> tcp4       0      0 10.60.6.12.5010        10.60.25.113.33339     CLOSED
> tcp4       0      0 10.59.6.12.5010        10.59.25.113.54070     CLOSE_WAIT
> tcp4       0      0 10.60.6.12.5010        10.60.25.113.53806     CLOSE_WAIT
> 
>    There are more than 1400 sockets in the CLOSE* state. What would
> prevent these sockets from cleaning up in a reasonable timeframe?
> Both sides of the mpio connection appear to be attempting reconnects.
> 
>    An attempt to gracefully kill istgt fails. A kill -9 does not clean
> things up either.
> 
>    A procstat -kk 1224 after the kill -9 shows:
> 
>   PID    TID COMM             TDNAME           KSTACK
>  1224 100959 istgt            sigthread        mi_switch+0x186 sleepq_wait+0x42 _cv_wait+0x121 zio_wait+0x61 dbuf_read+0x5e5 dmu_buf_hold+0xe0 zap_lockdir+0x58 zap_
> lookup_norm+0x45 zap_lookup+0x2e zfs_dirent_lock+0x4ff zfs_dirlook+0x69 zfs_lookup+0x26b zfs_freebsd_lookup+0x81 vfs_cache_lookup+0xf8 VOP_LOOKUP_APV+0x40 lookup+0x
> 464 namei+0x4e9 vn_open_cred+0x3cb
>  1224 100960 istgt            luthread #1      mi_switch+0x186 sleepq_wait+0x42 _sleep+0x376 bwait+0x64 physio+0x246 devfs_write_f+0x8d dofilewrite+0x8b kern_writev
> +0x6c sys_write+0x64 amd64_syscall+0x546 Xfast_syscall+0xf7
>  1224 103533 istgt            sendthread #1493 mi_switch+0x186 thread_suspend_switch+0xc9 thread_single+0x1b2 exit1+0x72 sigexit+0x7c postsig+0x3a4 ast+0x26c doreti
> _ast+0x1f
> 
> 
>    An attempt to forcefully export the pool hangs also. A procstat
> shows:
> 
>   PID    TID COMM             TDNAME           KSTACK                       
>  4427 100991 zpool            -                mi_switch+0x186 sleepq_wait+0x42 _cv_wait+0x121 dbuf_read+0x30b dmu_buf_hold+0xe0 zap_lockdir+0x58 zap_lookup_norm+0x45 zap_lookup+0x2e dsl_dir_open_spa+0x121 dsl_dataset_hold+0x3b dmu_objset_hold+0x23 zfs_ioc_objset_stats+0x2b zfsdev_ioctl+0xe6 devfs_ioctl_f+0x7b kern_ioctl+0x115 sys_ioctl+0xfd amd64_syscall+0x546 Xfast_syscall+0xf7 
> 
> 
> 
>    If anyone has any ideas, please let me know. I know I've left a lot
> of config information out in an attempt to keep the email shorter.
> 
>    Random comments:
> 
>    This happens with or without multipathd enabled on the linux client.
> 
>    If I catch the istgt daemon while it's creating threads and kill it
> the system will not lock up.
> 
>    I see no errors in the istgt log file. One of my next things to try
> is to enable all debugging... The amount of debugging data captured
> is quite large :-(
> 
>    I am using chelsio 10G cards on both client/server which have been
> rock solid in all other cases.
> 
>    Thoughts welcome!
> 
> Thanks,
> John

Hi Folks,

   I've managed to replicate this problem once. Basically, it appears
the linux client sends an abort which is processed here:

istgt_iscsi_op_task:

        switch (function) {
        case ISCSI_TASK_FUNC_ABORT_TASK:
                ISTGT_LOG("ABORT_TASK\n");
                SESS_MTX_LOCK(conn);
                rc = istgt_lu_clear_task_ITLQ(conn, conn->sess->lu, lun,
                    ref_CmdSN);
                SESS_MTX_UNLOCK(conn);
                if (rc < 0) {
                        ISTGT_ERRLOG("LU reset failed\n");
                }
                istgt_clear_transfer_task(conn, ref_CmdSN);
                break;

   At this point, the queue depth is 62. There appears to be one thread
in the zfs code performing a read.

   No other processing occurs after this point. A zfs list hangs. The
pool cannot be exported. The istgt daemon cannot be fully killed. A
reboot requires a power reset (ie: reboot hangs after flushing buffers).

   The only thing that does appear to be happening is a growing list
of connections:

tcp4       0      0 10.60.6.12.5010        10.60.25.113.56577     CLOSE_WAIT
tcp4       0      0 10.60.6.12.5010        10.60.25.113.56576     CLOSE_WAIT
tcp4       0      0 10.60.6.12.5010        10.60.25.113.56575     CLOSE_WAIT
tcp4       0      0 10.60.6.12.5010        10.60.25.113.56574     CLOSE_WAIT
tcp4       0      0 10.60.6.12.5010        10.60.25.113.56573     CLOSE_WAIT
tcp4       0      0 10.60.6.12.5010        10.60.25.113.56572     CLOSE_WAIT
tcp4       0      0 10.60.6.12.5010        10.60.25.113.56571     CLOSE_WAIT
tcp4       0      0 10.60.6.12.5010        10.60.25.113.56570     CLOSE_WAIT
tcp4       0      0 10.60.6.12.5010        10.60.25.113.56569     CLOSE_WAIT

   Currently, about 390 and slowly going up. This implies to me that there
is some sort of reconnect ocurring that is failing.

   On the client side, I think the problem is related to a Chelsio N320
10G nic which is showing RX overflows. After showing about 40000 overflows
the ABORT was received on the server side. I've never seen a chelsio card
have overflow problems. The server is using the same model chelsio card
with no issues.

   Again, any thoughts/comments are welcome!

Thanks,
John