9.2 ixgbe tx queue hang

Thu Mar 20 10:41:00 UTC 2014

On 19.03.2014, at 20:17, Christopher Forgeron <csforgeron at gmail.com> wrote:

> Hello,
> 
> 
> 
> I can report this problem as well on 10.0-RELEASE.
> 
> 
> 
> I think it's the same as kern/183390?

Possible. We still see this on nfsclients only, but I’m not convinced that nfs is the only trigger.

> I have two physically identical machines, one running 9.2-STABLE, and one
> on 10.0-RELEASE.
> 
> 
> 
> My 10.0 machine used to be running 9.0-STABLE for over a year without any
> problems.
> 
> 
> 
> I'm not having the problems with 9.2-STABLE as far as I can tell, but it
> does seem to be a load-based issue more than anything. Since my 9.2 system
> is in production, I'm unable to load it to see if the problem exists there.
> I have a ping_logger.py running on it now to see if it's experiencing
> problems briefly or not.

I our case, when it happens, the problem persists for quite some time (minutes or hours) if we don’t interact (ifconfig or reboot).

> I am able to reproduce it fairly reliably within 15 min of a reboot by
> loading the server via NFS with iometer and some large NFS file copies at
> the same time. I seem to need to sustain ~2 Gbps for a few minutes.

That’s probably why we can’t reproduce it reliably here. Although having 10gig cards in our blade servers, the ones affected are connected to a 1gig switch.

> It will happen with just ix0 (no lagg) or with lagg enabled across ix0 and
> ix1.

Same here.

> I've been load-testing new FreeBSD-10.0-RELEASE SAN's for production use
> here, so I'm quite willing to put time into this to help find out where
> it's coming from.  It took me a day to track down my iometer issues as
> being network related, and another day to isolate and write scripts to
> reproduce.
> 
> 
> 
> The symptom I notice is:
> 
> -          A running flood ping (ping -f 172.16.0.31) to the same hardware
> (running 9.2) will come back with "ping: sendto: File too large" when the
> problem occurs
> 
> -          Network connectivity is very spotty during these incidents
> 
> -          It can run with sporadic ping errors, or it can run a straight
> set of errors for minutes at a time
> 
> -          After a long run of ping errors, ESXi will show a disconnect
> from the hosted NFS stores on this machine.
> 
> -          I've yet to see it happen right after boot. Fastest is around 5
> min, normally it's within 15 min.

Can you try this when the problem occurs?

for CPU in {0..7}; do echo "CPU${CPU}"; cpuset -l ${CPU} ping -i 0.2 -c 2 -W 1 10.0.0.1 | grep sendto; done

It will tie ping to certain cpus to test the different tx queues of your ix interface. If the pings reliably fail only on some queues, then your problem is more likely to be the same as ours.

Also, if you have dtrace available:

kldload dtraceall
dtrace -n 'fbt:::return / arg1 == EFBIG && execname == "ping" / { stack(); }'

while you run pings over the interface affected. This will give you hints about where the EFBIG error comes from.

> […]

Markus