11.0 stuck on high network load

Sun Sep 25 18:58:19 UTC 2016

On Fri, Sep 23, 2016 at 10:16:56PM +0300, Slawa Olhovchenkov wrote:

> On Thu, Sep 22, 2016 at 01:20:45PM +0300, Slawa Olhovchenkov wrote:
> 
> > On Thu, Sep 22, 2016 at 12:04:40PM +0200, Julien Charbon wrote:
> > 
> > > >>  These paths can indeed compete for the same INP lock, as both
> > > >> tcp_tw_2msl_scan() calls always start with the first inp found in
> > > >> twq_2msl list.  But in both cases, this first inp should be quickly used
> > > >> and its lock released anyway, thus that could explain your situation it
> > > >> that the TCP stack is doing that all the time, for example:
> > > >>
> > > >>  - Let say that you are running out completely and constantly of tcptw,
> > > >> and then all connections transitioning to TIME_WAIT state are competing
> > > >> with the TIME_WAIT timeout scan that tries to free all the expired
> > > >> tcptw.  If the stack is doing that all the time, it can appear like
> > > >> "live" locked.
> > > >>
> > > >>  This is just an hypothesis and as usual might be a red herring.
> > > >> Anyway, could you run:
> > > >>
> > > >> $ vmstat -z | head -2; vmstat -z | grep -E 'tcp|sock'
> > > > 
> > > > ITEM                   SIZE  LIMIT     USED     FREE      REQ FAIL SLEEP
> > > > 
> > > > socket:                 864, 4192664,   18604,   25348,49276158,   0,   0
> > > > tcp_inpcb:              464, 4192664,   34226,   18702,49250593,   0,   0
> > > > tcpcb:                 1040, 4192665,   18424,   18953,49250593,   0,   0
> > > > tcptw:                   88,  16425,   15802,     623,14526919,   8,   0
> > > > tcpreass:                40,  32800,      15,    2285,  632381,   0,   0
> > > > 
> > > > In normal case tcptw is about 16425/600/900
> > > > 
> > > > And after `sysctl -a | grep tcp` system stuck on serial console and I am reset it.
> > > > 
> > > >>  Ideally, once when everything is ok, and once when you have the issue
> > > >> to see the differences (if any).
> > > >>
> > > >>  If it appears your are quite low in tcptw, and if you have enough
> > > >> memory, could you try increase the tcptw limit using sysctl
> > > > 
> > > > I think this is not eliminate stuck, just may do it less frequency
> > > 
> > >  You are right, it would just be a big hint that the tcp_tw_2msl_scan()
> > > contention hypothesis is the right one.  As I see you have plenty of
> > > memory on your server, thus could you try with:
> > > 
> > > net.inet.tcp.maxtcptw=4192665
> > > 
> > >  And see what happen. Just to validate this hypothesis.
> > 
> > This is bad way for validate, with maxtcptw=16384 happened is random
> > and can be waited for month. After maxtcptw=4192665 I am don't know
> > how long need to wait for verification this hypothesis.
> > 
> > More frequency (may be 3-5 times per day) happening less traffic drops
> > (not to zero for minutes). May be this caused also by contention in
> > tcp_tw_2msl_scan, but fast resolved (stochastic process). By eating
> > CPU power nginx can't service connection and clients closed
> > connections and need more TIME_WAIT and can trigered
> > tcp_tw_2msl_scan(reuse=1). After this we can got live lock.
> > 
> > May be after I learning to catch and dignostic this validation is more
> > accurately.
> 
> Some more bits:
> 
> socket:                 864, 4192664,   30806,     790,28524160,   0,   0
> ipq:                     56,  32802,       0,    1278,    1022,   0,   0
> udp_inpcb:              464, 4192664,      44,     364,   14066,   0,   0
> udpcb:                   32, 4192750,      44,    3081,   14066,   0,   0
> tcp_inpcb:              464, 4192664,   38558,     378,28476709,   0,   0
> tcpcb:                 1040, 4192665,   30690,     738,28476709,   0,   0
> tcptw:                   88,  32805,    7868,     772, 8412249,   0,   0
> 
> last pid: 49575;  load averages:  2.00,  2.05,  3.75    up 1+01:12:08  22:13:42
> 853 processes: 15 running, 769 sleeping, 35 waiting, 34 lock
> CPU 0:   0.0% user,  0.0% nice,  0.0% system,  100% interrupt,  0.0% idle
> CPU 1:   0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
> CPU 2:   0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
> CPU 3:   0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
> CPU 4:   0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
> CPU 5:   0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
> CPU 6:   0.0% user,  0.0% nice,  0.4% system,  0.0% interrupt, 99.6% idle
> CPU 7:   0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
> CPU 8:   0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
> CPU 9:   0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
> CPU 10:  0.0% user,  0.0% nice,  0.4% system,  0.0% interrupt, 99.6% idle
> CPU 11:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
> Mem: 8659M Active, 8385M Inact, 107G Wired, 1325M Free
> ARC: 99G Total, 88G MFU, 10G MRU, 32K Anon, 167M Header, 529M Other
> Swap: 32G Total, 32G Free

I can collect some more info.
First:

 1046 -         -  R      125:10.79 nginx: worker process (nginx)
 1047 kqread    -  I      150:11.98 nginx: worker process (nginx)
 1049 kqread    -  I      145:58.35 nginx: worker process (nginx)
 1050 kqread    -  I      136:33.36 nginx: worker process (nginx)
 1051 kqread    -  I      140:59.73 nginx: worker process (nginx)
 1052 kqread    -  I      137:18.12 nginx: worker process (nginx)

pid 1046 is nginx running on CPU0 (affinity mask set).

# procstat -k -k 1046
  PID    TID COMM             TDNAME           KSTACK
 1046 100686 nginx            -                mi_switch+0xd2 critical_exit+0x7e lapic_handle_timer+0xb1 Xtimerint+0x8c __mtx_lock_sleep+0x168 zone_fetch_slab+0x47 zone_import+0x52 zone_alloc_item+0x36 keg_alloc_slab+0x63 keg_fetch_slab+0x16e zone_fetch_slab+0x6e zone_import+0x52 uma_zalloc_arg+0x36e m_getm2+0x14f m_uiotombuf+0x64 sosend_generic+0x356 soo_write+0x42 dofilewrite+0x87

Tracing command nginx pid 1046 tid 100686 td 0xfffff8014485f500
sched_switch() at 0xffffffff804c956d = sched_switch+0x6ad/frame 0xfffffe20216992a0 /usr/src/sys/kern/sched_ule.c:1973
mi_switch() at 0xffffffff804a8d92 = mi_switch+0xd2/frame 0xfffffe20216992d0 /usr/src/sys/kern/kern_synch.c:465
critical_exit() at 0xffffffff804a6bee = critical_exit+0x7e/frame 0xfffffe20216992f0 /usr/src/sys/kern/kern_switch.c:219
lapic_handle_timer() at 0xffffffff80771701 = lapic_handle_timer+0xb1/frame 0xfffffe2021699330 /usr/src/sys/x86/x86/local_apic.c:1185
Xtimerint() at 0xffffffff806cbbcc = Xtimerint+0x8c/frame 0xfffffe2021699330 /usr/src/sys/amd64/amd64/apic_vector.S:135
--- interrupt, rip = 0xffffffff804de424, rsp = 0xfffffe2021699400, rbp = 0xfffffe2021699420 ---
lock_delay() at 0xffffffff804de424 = lock_delay+0x54/frame 0xfffffe2021699420 /usr/src/sys/kern/subr_lock.c:127
__mtx_lock_sleep() at 0xffffffff80484dc8 = __mtx_lock_sleep+0x168/frame 0xfffffe20216994a0 /usr/src/sys/kern/kern_mutex.c:512
zone_fetch_slab() at 0xffffffff806a4257 = zone_fetch_slab+0x47/frame 0xfffffe20216994e0 /usr/src/sys/vm/uma_core.c:2378
zone_import() at 0xffffffff806a4312 = zone_import+0x52/frame 0xfffffe2021699530 /usr/src/sys/vm/uma_core.c:2501
zone_alloc_item() at 0xffffffff806a0986 = zone_alloc_item+0x36/frame 0xfffffe2021699570 /usr/src/sys/vm/uma_core.c:2591
keg_alloc_slab() at 0xffffffff806a2463 = keg_alloc_slab+0x63/frame 0xfffffe20216995d0 /usr/src/sys/vm/uma_core.c:965
keg_fetch_slab() at 0xffffffff806a48ce = keg_fetch_slab+0x16e/frame 0xfffffe2021699620 /usr/src/sys/vm/uma_core.c:2349
zone_fetch_slab() at 0xffffffff806a427e = zone_fetch_slab+0x6e/frame 0xfffffe2021699660 /usr/src/sys/vm/uma_core.c:2375
zone_import() at 0xffffffff806a4312 = zone_import+0x52/frame 0xfffffe20216996b0 /usr/src/sys/vm/uma_core.c:2501 
uma_zalloc_arg() at 0xffffffff806a147e = uma_zalloc_arg+0x36e/frame 0xfffffe2021699720 /usr/src/sys/vm/uma_core.c:2531
m_getm2() at 0xffffffff8048231f = m_getm2+0x14f/frame 0xfffffe2021699790 /usr/src/sys/kern/kern_mbuf.c:830
m_uiotombuf() at 0xffffffff80516044 = m_uiotombuf+0x64/frame 0xfffffe20216997e0 /usr/src/sys/kern/uipc_mbuf.c:1535
sosend_generic() at 0xffffffff8051ce56 = sosend_generic+0x356/frame 0xfffffe20216998a0
soo_write() at 0xffffffff804fd872 = soo_write+0x42/frame 0xfffffe20216998d0
dofilewrite() at 0xffffffff804f5c97 = dofilewrite+0x87/frame 0xfffffe2021699920
kern_writev() at 0xffffffff804f5978 = kern_writev+0x68/frame 0xfffffe2021699970
sys_writev() at 0xffffffff804f5be6 = sys_writev+0x36/frame 0xfffffe20216999a0
amd64_syscall() at 0xffffffff806e4051 = amd64_syscall+0x2c1/frame 0xfffffe2021699ab0
Xfast_syscall() at 0xffffffff806cb2bb = Xfast_syscall+0xfb/frame 0xfffffe2021699ab0
--- syscall (121, FreeBSD ELF64, sys_writev), rip = 0x8019cc6ba, rsp = 0x7fffffffd688, rbp = 0x7fffffffd6c0 ---

And I am now collect and save kernel dump.

# vmstat -M /var/crash/vmcore.1 -z| grep -i mbuf
mbuf_packet:            256, 52276395,   26987,  189713,705271653,   0,   0
mbuf:                   256, 52276395,   61994,  782439,46250344524,   0,   0
mbuf_cluster:          2048, 1048576,   49207,   34461,  293307,   0,   0
mbuf_jumbo_page:       4096, 4084093,   61839,  613537,8265099417,   0,   0
mbuf_jumbo_9k:         9216, 3630303,       0,       0,       0,   0,   0
mbuf_jumbo_16k:       16384, 2722728,       0,       0,       0,   0,   0