9.2 ixgbe tx queue hang (was: Network loss)

Fri Mar 7 13:55:48 UTC 2014

On 06.03.2014, at 22:38, Jack Vogel <jfvogel at gmail.com> wrote:

> I suppose to be sure where the issue really occurs it would be best if both
> pseudo drivers
> were out of the picture.  Once we see if it still occurs we can take the
> next step.
> 
> Regards,
> 
> Jack

I removed lagg and vlan interfaces on four servers shortly after your email. And one of these is already showing the symptoms. I think this time we caught it earlier, and saw EFBIG instead of ENOBUFS. Still coming from only two cpus (or two tx queues):

# cpuset -l 0 ping 10.0.0.1
PING 10.0.0.1 (10.0.0.1): 56 data bytes
64 bytes from 10.0.0.1: icmp_seq=0 ttl=255 time=1.150 ms
[…]
# cpuset -l 1 ping 10.0.0.1
PING 10.0.0.1 (10.0.0.1): 56 data bytes
64 bytes from 10.0.0.1: icmp_seq=0 ttl=255 time=0.385 ms
[…]
# cpuset -l 2 ping 10.0.0.1
PING 10.0.0.1 (10.0.0.1): 56 data bytes
64 bytes from 10.0.0.1: icmp_seq=0 ttl=255 time=0.385 ms
[…]
# cpuset -l 3 ping 10.0.0.1
PING 10.0.0.1 (10.0.0.1): 56 data bytes
ping: sendto: File too large
ping: sendto: File too large
ping: sendto: File too large
[…]
# cpuset -l 4 ping 10.0.0.1
PING 10.0.0.1 (10.0.0.1): 56 data bytes
ping: sendto: File too large
ping: sendto: File too large
ping: sendto: File too large
[…]
# cpuset -l 5 ping 10.0.0.1
PING 10.0.0.1 (10.0.0.1): 56 data bytes
64 bytes from 10.0.0.1: icmp_seq=0 ttl=255 time=0.316 ms
[… (other cpus work)]

So the fact that only a few cpus/tx queues are affected remains true even without lagg/vlans. But we’ve got a new stack trace:

# dtrace -n 'fbt:::return / arg1 == EFBIG && execname == "ping" / { stack(); }'
dtrace: description 'fbt:::return ' matched 25400 probes
[…]
 3  28502   _bus_dmamap_load_buffer:return 
              kernel`_bus_dmamap_load_mbuf_sg+0x5f
              kernel`bus_dmamap_load_mbuf_sg+0x38
              kernel`ixgbe_xmit+0xcf
              kernel`ixgbe_mq_start_locked+0x94
              kernel`ixgbe_mq_start+0x12a
              kernel`ether_output_frame+0x33
              kernel`ether_output+0x4fe
              kernel`ip_output+0xd74
              kernel`rip_output+0x229
              kernel`sosend_generic+0x3f6
              kernel`kern_sendit+0x1a3
              kernel`sendit+0xdc
              kernel`sys_sendto+0x4d
              kernel`amd64_syscall+0x5ea
              kernel`0xffffffff80d35667
[…]

But after a while, everything went back to like it was yesterday (ENOBUFS):

# cpuset -l 3 ping 10.0.0.1
PING 10.0.0.1 (10.0.0.1): 56 data bytes
ping: sendto: No buffer space available
[… (cpu4 the same, other cpus work]

and the already known stacktrace:

# dtrace -n 'fbt:::return / arg1 == ENOBUFS && execname == "ping" / { stack(); }'
dtrace: description 'fbt:::return ' matched 25400 probes
CPU     ID                    FUNCTION:NAME
  3   7730            ixgbe_mq_start:return 
              kernel`ether_output_frame+0x33
              kernel`ether_output+0x4fe
              kernel`ip_output+0xd74
              kernel`rip_output+0x229
              kernel`sosend_generic+0x3f6
              kernel`kern_sendit+0x1a3
              kernel`sendit+0xdc
              kernel`sys_sendto+0x4d
              kernel`amd64_syscall+0x5ea
              kernel`0xffffffff80d35667

So in my opinion, the different situation in the first few minutes was not because lagg and vlans were removed, it’s just that we caught it early and the buf_ring was not full yet, but at least now we’ve got a stack trace of what happens before, which hopefully helps debugging this.

I also noticed that queue txd sysctls barely change or don’t change at all on affected queues:

# while true; do sysctl dev.ix.1.queue3 dev.ix.1.queue4 | grep txd_; echo ----; sleep 1; done
dev.ix.1.queue3.txd_head: 717
dev.ix.1.queue3.txd_tail: 717
dev.ix.1.queue4.txd_head: 1517
dev.ix.1.queue4.txd_tail: 1517
----
dev.ix.1.queue3.txd_head: 717
dev.ix.1.queue3.txd_tail: 717
dev.ix.1.queue4.txd_head: 1517
dev.ix.1.queue4.txd_tail: 1517
----
dev.ix.1.queue3.txd_head: 717
dev.ix.1.queue3.txd_tail: 717
dev.ix.1.queue4.txd_head: 1517
dev.ix.1.queue4.txd_tail: 1517
----
dev.ix.1.queue3.txd_head: 717
dev.ix.1.queue3.txd_tail: 717
dev.ix.1.queue4.txd_head: 1517
dev.ix.1.queue4.txd_tail: 1517

But I guess that’s expected when the buf_ring is full… But maybe it’s important to you, that I might have seen these values rise just once, when the problem already had started.

And netstat -m:

# netstat -m
37872/10263/48135 mbufs in use (current/cache/total)
33142/11680/44822/132096 mbuf clusters in use (current/cache/total/max)
33044/6508 mbuf+clusters out of packet secondary zone in use (current/cache)
94/2995/3089/66048 4k (page size) jumbo clusters in use (current/cache/total/max)
0/0/0/33024 9k jumbo clusters in use (current/cache/total/max)
0/0/0/16512 16k jumbo clusters in use (current/cache/total/max)
76128K/37905K/114033K bytes allocated to network (current/cache/total)
0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters delayed (4k/9k/16k)
0/0/0 requests for jumbo clusters denied (4k/9k/16k)
0/0/0 sfbufs in use (current/peak/max)
0 requests for sfbufs denied
0 requests for sfbufs delayed
0 requests for I/O initiated by sendfile
0 calls to protocol drain routines

Markus

> On Thu, Mar 6, 2014 at 12:58 PM, Markus Gebert
> <markus.gebert at hostpoint.ch>wrote:
> 
>> 
>> On 06.03.2014, at 19:33, Jack Vogel <jfvogel at gmail.com> wrote:
>> 
>>> You did not make it explicit before, but I noticed in your dtrace info
>> that
>>> you are using
>>> lagg, its been the source of lots of problems, so take it out of the
>> setup
>>> and see if this
>>> queue problem still happens please.
>>> 
>>> Jack
>> 
>> Well, last year when upgrading another batch of servers (same hardware) to
>> 9.2, we tried find a solution to this network problem, and we eliminated
>> lagg where we had used it before, which did not help at all. That's why I
>> didn't mention it explicitly.
>> 
>> My point is, I can confirm that 9.2 has network problems on this same
>> hardware with or without lagg, so it's unlikely that removing it will bring
>> immediate success. OTOH, I didn't have this tx queue theory back then, so I
>> cannot be sure that what we saw then without lagg, and what we see now with
>> lagg, really are the same problem.
>> 
>> I guess, for the sake of simplicity I will remove lagg on these new
>> systems. But before I do that, to save time, I wanted to ask wether I
>> should remove vlan interfaces too? While that didn't help either last year,
>> my guess is that I should take them out of the picture, unless you say
>> otherwise.
>> 
>> Thanks for looking into this.
>> 
>> 
>> Markus
>> 
>> 
>> 
>>> On Thu, Mar 6, 2014 at 2:24 AM, Markus Gebert <
>> markus.gebert at hostpoint.ch>wrote:
>>> 
>>>> (creating a new thread, because I'm no longer sure this is related to
>>>> Johan's thread that I originally used to discuss this)
>>>> 
>>>> On 27.02.2014, at 18:02, Jack Vogel <jfvogel at gmail.com> wrote:
>>>> 
>>>>> I would make SURE that you have enough mbuf resources of whatever size
>>>> pool
>>>>> that you are
>>>>> using (2, 4, 9K), and I would try the code in HEAD if you had not.
>>>>> 
>>>>> Jack
>>>> 
>>>> Jack, we've upgraded some other systems on which I get more time to
>> debug
>>>> (no impact for customers). Although those systems use the nfsclient
>> too, I
>>>> no longer think that NFS is the source of the problem (hence the new
>>>> thread). I think it's the ixgbe driver and/or card. When our problem
>>>> occurs, it looks like it's a single tx queue that gets stuck somehow
>> (its
>>>> buf_ring remains full).
>>>> 
>>>> I tracked ping using dtrace to determine the source of ENOBUFS it
>> returns
>>>> every few packets when things get weird:
>>>> 
>>>> # dtrace -n 'fbt:::return / arg1 == ENOBUFS && execname == "ping" / {
>>>> stack(); }'
>>>> dtrace: description 'fbt:::return ' matched 25476 probes
>>>> CPU     ID                    FUNCTION:NAME
>>>> 26   7730            ixgbe_mq_start:return
>>>>             if_lagg.ko`lagg_transmit+0xc4
>>>>             kernel`ether_output_frame+0x33
>>>>             kernel`ether_output+0x4fe
>>>>             kernel`ip_output+0xd74
>>>>             kernel`rip_output+0x229
>>>>             kernel`sosend_generic+0x3f6
>>>>             kernel`kern_sendit+0x1a3
>>>>             kernel`sendit+0xdc
>>>>             kernel`sys_sendto+0x4d
>>>>             kernel`amd64_syscall+0x5ea
>>>>             kernel`0xffffffff80d35667
>>>> 
>>>> 
>>>> 
>>>> The only way ixgbe_mq_start could return ENOBUFS would be when
>>>> drbr_enqueue() encouters a full tx buf_ring. Since a new ping packet
>>>> probably has no flow id, it should be assigned to a queue based on
>> curcpu,
>>>> which made me try to pin ping to single cpus to check wether it's always
>>>> the same tx buf_ring that reports being full. This turned out to be
>> true:
>>>> 
>>>> # cpuset -l 0 ping 10.0.4.5
>>>> PING 10.0.4.5 (10.0.4.5): 56 data bytes
>>>> 64 bytes from 10.0.4.5: icmp_seq=0 ttl=255 time=0.347 ms
>>>> 64 bytes from 10.0.4.5: icmp_seq=1 ttl=255 time=0.135 ms
>>>> 
>>>> # cpuset -l 1 ping 10.0.4.5
>>>> PING 10.0.4.5 (10.0.4.5): 56 data bytes
>>>> 64 bytes from 10.0.4.5: icmp_seq=0 ttl=255 time=0.184 ms
>>>> 64 bytes from 10.0.4.5: icmp_seq=1 ttl=255 time=0.232 ms
>>>> 
>>>> # cpuset -l 2 ping 10.0.4.5
>>>> PING 10.0.4.5 (10.0.4.5): 56 data bytes
>>>> ping: sendto: No buffer space available
>>>> ping: sendto: No buffer space available
>>>> ping: sendto: No buffer space available
>>>> ping: sendto: No buffer space available
>>>> ping: sendto: No buffer space available
>>>> 
>>>> # cpuset -l 3 ping 10.0.4.5
>>>> PING 10.0.4.5 (10.0.4.5): 56 data bytes
>>>> 64 bytes from 10.0.4.5: icmp_seq=0 ttl=255 time=0.130 ms
>>>> 64 bytes from 10.0.4.5: icmp_seq=1 ttl=255 time=0.126 ms
>>>> [...snip...]
>>>> 
>>>> The system has 32 cores, if ping runs on cpu 2, 10, 18 or 26, which use
>>>> the third tx buf_ring, ping reliably return ENOBUFS. If ping is run on
>> any
>>>> other cpu using any other tx queue, it runs without any packet loss.
>>>> 
>>>> So, when ENOBUFS is returned, this is not due to an mbuf shortage, it's
>>>> because the buf_ring is full. Not surprisingly, netstat -m looks pretty
>>>> normal:
>>>> 
>>>> # netstat -m
>>>> 38622/11823/50445 mbufs in use (current/cache/total)
>>>> 32856/11642/44498/132096 mbuf clusters in use (current/cache/total/max)
>>>> 32824/6344 mbuf+clusters out of packet secondary zone in use
>>>> (current/cache)
>>>> 16/3906/3922/66048 4k (page size) jumbo clusters in use
>>>> (current/cache/total/max)
>>>> 0/0/0/33024 9k jumbo clusters in use (current/cache/total/max)
>>>> 0/0/0/16512 16k jumbo clusters in use (current/cache/total/max)
>>>> 75431K/41863K/117295K bytes allocated to network (current/cache/total)
>>>> 0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
>>>> 0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters)
>>>> 0/0/0 requests for jumbo clusters delayed (4k/9k/16k)
>>>> 0/0/0 requests for jumbo clusters denied (4k/9k/16k)
>>>> 0/0/0 sfbufs in use (current/peak/max)
>>>> 0 requests for sfbufs denied
>>>> 0 requests for sfbufs delayed
>>>> 0 requests for I/O initiated by sendfile
>>>> 0 calls to protocol drain routines
>>>> 
>>>> In the meantime I've checked the commit log of the ixgbe driver in HEAD
>>>> and besides there are little differences between HEAD and 9.2, I don't
>> see
>>>> a commit that fixes anything related to what were seeing...
>>>> 
>>>> So, what's the conclusion here? Firmware bug that's only triggered under
>>>> 9.2? Driver bug introduced between 9.1 and 9.2 when new multiqueue stuff
>>>> was added? Jack, how should we proceed?
>>>> 
>>>> 
>>>> Markus
>>>> 
>>>> 
>>>> 
>>>> On Thu, Feb 27, 2014 at 8:05 AM, Markus Gebert
>>>> <markus.gebert at hostpoint.ch>wrote:
>>>> 
>>>>> 
>>>>> On 27.02.2014, at 02:00, Rick Macklem <rmacklem at uoguelph.ca> wrote:
>>>>> 
>>>>>> John Baldwin wrote:
>>>>>>> On Tuesday, February 25, 2014 2:19:01 am Johan Kooijman wrote:
>>>>>>>> Hi all,
>>>>>>>> 
>>>>>>>> I have a weird situation here where I can't get my head around.
>>>>>>>> 
>>>>>>>> One FreeBSD 9.2-STABLE ZFS/NFS box, multiple Linux clients. Once in
>>>>>>>> a while
>>>>>>>> the Linux clients loose their NFS connection:
>>>>>>>> 
>>>>>>>> Feb 25 06:24:09 hv3 kernel: nfs: server 10.0.24.1 not responding,
>>>>>>>> timed out
>>>>>>>> 
>>>>>>>> Not all boxes, just one out of the cluster. The weird part is that
>>>>>>>> when I
>>>>>>>> try to ping a Linux client from the FreeBSD box, I have between 10
>>>>>>>> and 30%
>>>>>>>> packetloss - all day long, no specific timeframe. If I ping the
>>>>>>>> Linux
>>>>>>>> clients - no loss. If I ping back from the Linux clients to FBSD
>>>>>>>> box - no
>>>>>>>> loss.
>>>>>>>> 
>>>>>>>> The errors I get when pinging a Linux client is this one:
>>>>>>>> ping: sendto: File too large
>>>>> 
>>>>> We were facing similar problems when upgrading to 9.2 and have stayed
>>>> with
>>>>> 9.1 on affected systems for now. We've seen this on HP G8 blades with
>>>>> 82599EB controllers:
>>>>> 
>>>>> ix0 at pci0:4:0:0: class=0x020000 card=0x18d0103c chip=0x10f88086
>> rev=0x01
>>>>> hdr=0x00
>>>>>  vendor     = 'Intel Corporation'
>>>>>  device     = '82599EB 10 Gigabit Dual Port Backplane Connection'
>>>>>  class      = network
>>>>>  subclass   = ethernet
>>>>> 
>>>>> We didn't find a way to trigger the problem reliably. But when it
>> occurs,
>>>>> it usually affects only one interface. Symptoms include:
>>>>> 
>>>>> - socket functions return the 'File too large' error mentioned by Johan
>>>>> - socket functions return 'No buffer space' available
>>>>> - heavy to full packet loss on the affected interface
>>>>> - "stuck" TCP connection, i.e. ESTABLISHED TCP connections that should
>>>>> have timed out stick around forever (socket on the other side could
>> have
>>>>> been closed ours ago)
>>>>> - userland programs using the corresponding sockets usually got stuck
>> too
>>>>> (can't find kernel traces right now, but always in network related
>>>> syscalls)
>>>>> 
>>>>> Network is only lightly loaded on the affected systems (usually 5-20
>>>> mbit,
>>>>> capped at 200 mbit, per server), and netstat never showed any
>> indication
>>>> of
>>>>> ressource shortage (like mbufs).
>>>>> 
>>>>> What made the problem go away temporariliy was to ifconfig down/up the
>>>>> affected interface.
>>>>> 
>>>>> We tested a 9.2 kernel with the 9.1 ixgbe driver, which was not really
>>>>> stable. Also, we tested a few revisions between 9.1 and 9.2 to find out
>>>>> when the problem started. Unfortunately, the ixgbe driver turned out to
>>>> be
>>>>> mostly unstable on our systems between these releases, worse than on
>> 9.2.
>>>>> The instability was introduced shortly after to 9.1 and fixed only very
>>>>> shortly before 9.2 release. So no luck there. We ended up using 9.1
>> with
>>>>> backports of 9.2 features we really need.
>>>>> 
>>>>> What we can't tell is wether it's the 9.2 kernel or the 9.2 ixgbe
>> driver
>>>>> or a combination of both that causes these problems. Unfortunately we
>> ran
>>>>> out of time (and ideas).
>>>>> 
>>>>> 
>>>>>>> EFBIG is sometimes used for drivers when a packet takes too many
>>>>>>> scatter/gather entries.  Since you mentioned NFS, one thing you can
>>>>>>> try is to
>>>>>>> disable TSO on the intertface you are using for NFS to see if that
>>>>>>> "fixes" it.
>>>>>>> 
>>>>>> And please email if you try it and let us know if it helps.
>>>>>> 
>>>>>> I've think I've figured out how 64K NFS read replies can do this,
>>>>>> but I'll admit "ping" is a mystery? (Doesn't it just send a single
>>>>>> packet that would be in a single mbuf?)
>>>>>> 
>>>>>> I think the EFBIG is replied by bus_dmamap_load_mbuf_sg(), but I
>>>>>> don't know if it can happen for an mbuf chain with < 32 entries?
>>>>> 
>>>>> We don't use the nfs server on our systems, but they're
>> (new)nfsclients.
>>>>> So I don't think our problem is nfs related, unless the default
>>>> rsize/wsize
>>>>> for client mounts is not 8K, which I thought it was. Can you confirm
>>>> this,
>>>>> Rick?
>>>>> 
>>>>> IIRC, disabling TSO did not make any difference in our case.
>>>>> 
>>>>> 
>>>>> Markus
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>> _______________________________________________
>>> freebsd-net at freebsd.org mailing list
>>> http://lists.freebsd.org/mailman/listinfo/freebsd-net
>>> To unsubscribe, send any mail to "freebsd-net-unsubscribe at freebsd.org"
>>> 
>> 
>> 
> _______________________________________________
> freebsd-net at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscribe at freebsd.org"
>