9.2 ixgbe tx queue hang (was: Network loss)

Thu Mar 6 20:58:23 UTC 2014

On 06.03.2014, at 19:33, Jack Vogel <jfvogel at gmail.com> wrote:

> You did not make it explicit before, but I noticed in your dtrace info that
> you are using
> lagg, its been the source of lots of problems, so take it out of the setup
> and see if this
> queue problem still happens please.
> 
> Jack

Well, last year when upgrading another batch of servers (same hardware) to 9.2, we tried find a solution to this network problem, and we eliminated lagg where we had used it before, which did not help at all. That’s why I didn’t mention it explicitly.

My point is, I can confirm that 9.2 has network problems on this same hardware with or without lagg, so it’s unlikely that removing it will bring immediate success. OTOH, I didn’t have this tx queue theory back then, so I cannot be sure that what we saw then without lagg, and what we see now with lagg, really are the same problem.

I guess, for the sake of simplicity I will remove lagg on these new systems. But before I do that, to save time, I wanted to ask wether I should remove vlan interfaces too? While that didn’t help either last year, my guess is that I should take them out of the picture, unless you say otherwise.

Thanks for looking into this.

Markus

> On Thu, Mar 6, 2014 at 2:24 AM, Markus Gebert <markus.gebert at hostpoint.ch>wrote:
> 
>> (creating a new thread, because I'm no longer sure this is related to
>> Johan's thread that I originally used to discuss this)
>> 
>> On 27.02.2014, at 18:02, Jack Vogel <jfvogel at gmail.com> wrote:
>> 
>>> I would make SURE that you have enough mbuf resources of whatever size
>> pool
>>> that you are
>>> using (2, 4, 9K), and I would try the code in HEAD if you had not.
>>> 
>>> Jack
>> 
>> Jack, we've upgraded some other systems on which I get more time to debug
>> (no impact for customers). Although those systems use the nfsclient too, I
>> no longer think that NFS is the source of the problem (hence the new
>> thread). I think it's the ixgbe driver and/or card. When our problem
>> occurs, it looks like it's a single tx queue that gets stuck somehow (its
>> buf_ring remains full).
>> 
>> I tracked ping using dtrace to determine the source of ENOBUFS it returns
>> every few packets when things get weird:
>> 
>> # dtrace -n 'fbt:::return / arg1 == ENOBUFS && execname == "ping" / {
>> stack(); }'
>> dtrace: description 'fbt:::return ' matched 25476 probes
>> CPU     ID                    FUNCTION:NAME
>> 26   7730            ixgbe_mq_start:return
>>              if_lagg.ko`lagg_transmit+0xc4
>>              kernel`ether_output_frame+0x33
>>              kernel`ether_output+0x4fe
>>              kernel`ip_output+0xd74
>>              kernel`rip_output+0x229
>>              kernel`sosend_generic+0x3f6
>>              kernel`kern_sendit+0x1a3
>>              kernel`sendit+0xdc
>>              kernel`sys_sendto+0x4d
>>              kernel`amd64_syscall+0x5ea
>>              kernel`0xffffffff80d35667
>> 
>> 
>> 
>> The only way ixgbe_mq_start could return ENOBUFS would be when
>> drbr_enqueue() encouters a full tx buf_ring. Since a new ping packet
>> probably has no flow id, it should be assigned to a queue based on curcpu,
>> which made me try to pin ping to single cpus to check wether it's always
>> the same tx buf_ring that reports being full. This turned out to be true:
>> 
>> # cpuset -l 0 ping 10.0.4.5
>> PING 10.0.4.5 (10.0.4.5): 56 data bytes
>> 64 bytes from 10.0.4.5: icmp_seq=0 ttl=255 time=0.347 ms
>> 64 bytes from 10.0.4.5: icmp_seq=1 ttl=255 time=0.135 ms
>> 
>> # cpuset -l 1 ping 10.0.4.5
>> PING 10.0.4.5 (10.0.4.5): 56 data bytes
>> 64 bytes from 10.0.4.5: icmp_seq=0 ttl=255 time=0.184 ms
>> 64 bytes from 10.0.4.5: icmp_seq=1 ttl=255 time=0.232 ms
>> 
>> # cpuset -l 2 ping 10.0.4.5
>> PING 10.0.4.5 (10.0.4.5): 56 data bytes
>> ping: sendto: No buffer space available
>> ping: sendto: No buffer space available
>> ping: sendto: No buffer space available
>> ping: sendto: No buffer space available
>> ping: sendto: No buffer space available
>> 
>> # cpuset -l 3 ping 10.0.4.5
>> PING 10.0.4.5 (10.0.4.5): 56 data bytes
>> 64 bytes from 10.0.4.5: icmp_seq=0 ttl=255 time=0.130 ms
>> 64 bytes from 10.0.4.5: icmp_seq=1 ttl=255 time=0.126 ms
>> [...snip...]
>> 
>> The system has 32 cores, if ping runs on cpu 2, 10, 18 or 26, which use
>> the third tx buf_ring, ping reliably return ENOBUFS. If ping is run on any
>> other cpu using any other tx queue, it runs without any packet loss.
>> 
>> So, when ENOBUFS is returned, this is not due to an mbuf shortage, it's
>> because the buf_ring is full. Not surprisingly, netstat -m looks pretty
>> normal:
>> 
>> # netstat -m
>> 38622/11823/50445 mbufs in use (current/cache/total)
>> 32856/11642/44498/132096 mbuf clusters in use (current/cache/total/max)
>> 32824/6344 mbuf+clusters out of packet secondary zone in use
>> (current/cache)
>> 16/3906/3922/66048 4k (page size) jumbo clusters in use
>> (current/cache/total/max)
>> 0/0/0/33024 9k jumbo clusters in use (current/cache/total/max)
>> 0/0/0/16512 16k jumbo clusters in use (current/cache/total/max)
>> 75431K/41863K/117295K bytes allocated to network (current/cache/total)
>> 0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
>> 0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters)
>> 0/0/0 requests for jumbo clusters delayed (4k/9k/16k)
>> 0/0/0 requests for jumbo clusters denied (4k/9k/16k)
>> 0/0/0 sfbufs in use (current/peak/max)
>> 0 requests for sfbufs denied
>> 0 requests for sfbufs delayed
>> 0 requests for I/O initiated by sendfile
>> 0 calls to protocol drain routines
>> 
>> In the meantime I've checked the commit log of the ixgbe driver in HEAD
>> and besides there are little differences between HEAD and 9.2, I don't see
>> a commit that fixes anything related to what were seeing...
>> 
>> So, what's the conclusion here? Firmware bug that's only triggered under
>> 9.2? Driver bug introduced between 9.1 and 9.2 when new multiqueue stuff
>> was added? Jack, how should we proceed?
>> 
>> 
>> Markus
>> 
>> 
>> 
>> On Thu, Feb 27, 2014 at 8:05 AM, Markus Gebert
>> <markus.gebert at hostpoint.ch>wrote:
>> 
>>> 
>>> On 27.02.2014, at 02:00, Rick Macklem <rmacklem at uoguelph.ca> wrote:
>>> 
>>>> John Baldwin wrote:
>>>>> On Tuesday, February 25, 2014 2:19:01 am Johan Kooijman wrote:
>>>>>> Hi all,
>>>>>> 
>>>>>> I have a weird situation here where I can't get my head around.
>>>>>> 
>>>>>> One FreeBSD 9.2-STABLE ZFS/NFS box, multiple Linux clients. Once in
>>>>>> a while
>>>>>> the Linux clients loose their NFS connection:
>>>>>> 
>>>>>> Feb 25 06:24:09 hv3 kernel: nfs: server 10.0.24.1 not responding,
>>>>>> timed out
>>>>>> 
>>>>>> Not all boxes, just one out of the cluster. The weird part is that
>>>>>> when I
>>>>>> try to ping a Linux client from the FreeBSD box, I have between 10
>>>>>> and 30%
>>>>>> packetloss - all day long, no specific timeframe. If I ping the
>>>>>> Linux
>>>>>> clients - no loss. If I ping back from the Linux clients to FBSD
>>>>>> box - no
>>>>>> loss.
>>>>>> 
>>>>>> The errors I get when pinging a Linux client is this one:
>>>>>> ping: sendto: File too large
>>> 
>>> We were facing similar problems when upgrading to 9.2 and have stayed
>> with
>>> 9.1 on affected systems for now. We've seen this on HP G8 blades with
>>> 82599EB controllers:
>>> 
>>> ix0 at pci0:4:0:0: class=0x020000 card=0x18d0103c chip=0x10f88086 rev=0x01
>>> hdr=0x00
>>>   vendor     = 'Intel Corporation'
>>>   device     = '82599EB 10 Gigabit Dual Port Backplane Connection'
>>>   class      = network
>>>   subclass   = ethernet
>>> 
>>> We didn't find a way to trigger the problem reliably. But when it occurs,
>>> it usually affects only one interface. Symptoms include:
>>> 
>>> - socket functions return the 'File too large' error mentioned by Johan
>>> - socket functions return 'No buffer space' available
>>> - heavy to full packet loss on the affected interface
>>> - "stuck" TCP connection, i.e. ESTABLISHED TCP connections that should
>>> have timed out stick around forever (socket on the other side could have
>>> been closed ours ago)
>>> - userland programs using the corresponding sockets usually got stuck too
>>> (can't find kernel traces right now, but always in network related
>> syscalls)
>>> 
>>> Network is only lightly loaded on the affected systems (usually 5-20
>> mbit,
>>> capped at 200 mbit, per server), and netstat never showed any indication
>> of
>>> ressource shortage (like mbufs).
>>> 
>>> What made the problem go away temporariliy was to ifconfig down/up the
>>> affected interface.
>>> 
>>> We tested a 9.2 kernel with the 9.1 ixgbe driver, which was not really
>>> stable. Also, we tested a few revisions between 9.1 and 9.2 to find out
>>> when the problem started. Unfortunately, the ixgbe driver turned out to
>> be
>>> mostly unstable on our systems between these releases, worse than on 9.2.
>>> The instability was introduced shortly after to 9.1 and fixed only very
>>> shortly before 9.2 release. So no luck there. We ended up using 9.1 with
>>> backports of 9.2 features we really need.
>>> 
>>> What we can't tell is wether it's the 9.2 kernel or the 9.2 ixgbe driver
>>> or a combination of both that causes these problems. Unfortunately we ran
>>> out of time (and ideas).
>>> 
>>> 
>>>>> EFBIG is sometimes used for drivers when a packet takes too many
>>>>> scatter/gather entries.  Since you mentioned NFS, one thing you can
>>>>> try is to
>>>>> disable TSO on the intertface you are using for NFS to see if that
>>>>> "fixes" it.
>>>>> 
>>>> And please email if you try it and let us know if it helps.
>>>> 
>>>> I've think I've figured out how 64K NFS read replies can do this,
>>>> but I'll admit "ping" is a mystery? (Doesn't it just send a single
>>>> packet that would be in a single mbuf?)
>>>> 
>>>> I think the EFBIG is replied by bus_dmamap_load_mbuf_sg(), but I
>>>> don't know if it can happen for an mbuf chain with < 32 entries?
>>> 
>>> We don't use the nfs server on our systems, but they're (new)nfsclients.
>>> So I don't think our problem is nfs related, unless the default
>> rsize/wsize
>>> for client mounts is not 8K, which I thought it was. Can you confirm
>> this,
>>> Rick?
>>> 
>>> IIRC, disabling TSO did not make any difference in our case.
>>> 
>>> 
>>> Markus
>>> 
>>> 
>> 
>> 
>> 
>> 
> _______________________________________________
> freebsd-net at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscribe at freebsd.org"
>