Network stack returning EFBIG?

Thu Mar 20 15:22:39 UTC 2014

On 20.03.2014, at 14:51, wollman at bimajority.org wrote:

> In article <21290.60558.750106.630804 at hergotha.csail.mit.edu>, I wrote:
> 
>> Since we put this server into production, random network system calls
>> have started failing with [EFBIG] or maybe sometimes [EIO].  I've
>> observed this with a simple ping, but various daemons also log the
>> errors:
>> Mar 20 09:22:04 nfs-prod-4 sshd[42487]: fatal: Write failed: File too
>> large [preauth]
>> Mar 20 09:23:44 nfs-prod-4 nrpe[42492]: Error: Could not complete SSL
>> handshake. 5
> 
> I found at least one call stack where this happens and it does get
> returned all the way to userspace:
> 
> 17  15547   _bus_dmamap_load_buffer:return 
>              kernel`_bus_dmamap_load_mbuf_sg+0x5f
>              kernel`bus_dmamap_load_mbuf_sg+0x38
>              kernel`ixgbe_xmit+0xcf
>              kernel`ixgbe_mq_start_locked+0x94
>              kernel`ixgbe_mq_start+0x12a
>              if_lagg.ko`lagg_transmit+0xc4
>              kernel`ether_output_frame+0x33
>              kernel`ether_output+0x4fe
>              kernel`ip_output+0xd74
>              kernel`tcp_output+0xfea
>              kernel`tcp_usr_send+0x325
>              kernel`sosend_generic+0x3f6
>              kernel`soo_write+0x5e
>              kernel`dofilewrite+0x85
>              kernel`kern_writev+0x6c
>              kernel`sys_write+0x64
>              kernel`amd64_syscall+0x5ea
>              kernel`0xffffffff808443c7

This looks pretty similar to what we’ve seen when we got EFBIG:

 3  28502   _bus_dmamap_load_buffer:return 
              kernel`_bus_dmamap_load_mbuf_sg+0x5f
              kernel`bus_dmamap_load_mbuf_sg+0x38
              kernel`ixgbe_xmit+0xcf
              kernel`ixgbe_mq_start_locked+0x94
              kernel`ixgbe_mq_start+0x12a
              kernel`ether_output_frame+0x33
              kernel`ether_output+0x4fe
              kernel`ip_output+0xd74
              kernel`rip_output+0x229
              kernel`sosend_generic+0x3f6
              kernel`kern_sendit+0x1a3
              kernel`sendit+0xdc
              kernel`sys_sendto+0x4d
              kernel`amd64_syscall+0x5ea
              kernel`0xffffffff80d35667

In our case it looks like some of the ixgbe tx queues get stuck, and some don’t. You can test, wether your server shows the same symptoms with this command:

# for CPU in {0..7}; do echo "CPU${CPU}"; cpuset -l ${CPU} ping -i 0.5 -c 2 -W 1 10.0.0.1 | grep sendto; done

We also use 82599EB based ixgbe controllers on affected systems.

Also see these two threads on freebsd-net:

http://lists.freebsd.org/pipermail/freebsd-net/2014-February/037967.html
http://lists.freebsd.org/pipermail/freebsd-net/2014-March/038061.html

I have started the second one, and there are some more details of what we were seeing in case you’re interested.

Then there is:

http://www.freebsd.org/cgi/query-pr.cgi?pr=183390
and:
https://bugs.freenas.org/issues/4560

Markus