Should syncache.count ever be negative?

Sat Nov 10 00:13:40 PST 2007

On Fri, 9 Nov 2007, Matt Reimer wrote:

> Ok, I've run netperf in both directions. The box I've been targeting
> is 66.230.193.105 aka wordpress1.

Ok, at least that looks good.

> The machine is a Dell 1950 with 8 x 1.6GHz Xeon 5310s, 8G RAM, and this NIC:

Nice.

> I first noticed this problem running ab; then to simplify I used
> netrate/http[d]. What's strange is that it seems fine over the local
> network (~15800 requests/sec), but it slowed down dramatically (~150
> req/sec) when tested from another network 20 ms away. Running systat
> -tcp and nload I saw that there was an almost complete stall with only
> a handful of packets being sent (probably my ssh packets) for a few
> seconds or sometimes even up to 60 seconds or so.

I think most benchmarking tools end up stalling if all of their threads 
stall, that may be why the rate falls off after the misbehavior you 
describe below begins.

> Nov  9 19:02:34 wordpress1 kernel: TCP: [207.210.67.2]:64851 to
> [66.230.193.105]:80; syncache_socket: Socket create failed due to
> limits or memory shortage
> Nov  9 19:02:34 wordpress1 kernel: TCP: [207.210.67.2]:64851 to
> [66.230.193.105]:80 tcpflags 0x10<ACK>; tcp_input: Listen socket:
> Socket allocation failed due to limits or memory shortage, sending RST

Turns out you'll generally get both of those error messages together, from 
my reading of the code.

Since you eliminated memory shortage in the socket zone, the next thing to 
check is the length of the listen queues.  If the listen queue is backing 
up because the application isn't accepting fast enough, the errors above 
should happen.  "netstat -Lan" should show you what's going on there. 
Upping the specified listen queue length in your webserver _may_ be all 
that is necessary.  Try fiddling with that and watching how much they're 
filling up during testing.

The fact that you see the same port repeatedly may indicate that the 
syncache isn't destroying the syncache entries when you get the socket 
creation failure.  Take a look at "netstat -n" and look for SYN_RECEIVED 
entries - if they're sticking around for more than a few seconds, this is 
probably what's happening.  (This entire paragraph is speculation, but 
worth investigating.)

> I don't know if it's relevant, but accf_http is loaded on wordpress1.

That may be relevant - accepting filtering changes how the listen queues 
are used.  Try going back to non-accept filtering for now.

> We have seen similar behavior (TCP slowdowns) on a different machines
> (4 x Xeon 5160) with a different NIC (em0) running RELENG_7, though I
> haven't diagnosed it to this level of detail. All our RELENG_6 and
> RELENG_4 machines seem fine.

em is the driver that I was having issues with when it shared an 
interrupt... :)

FWIW, my crazy theory of the moment is this:  We have some bug that 
happens when the listen queues overflow in 7.0, and your test is strenuous 
enough to hit the listen queue overflow condition, leading to total 
collapse.  I'll have to cobble together a test program to see what happens 
in the listen queue overflow case.

Thanks for the quick feedback,

-Mike