Should syncache.count ever be negative?
Mike Silbersack
silby at silby.com
Sat Nov 10 00:13:40 PST 2007
On Fri, 9 Nov 2007, Matt Reimer wrote:
> Ok, I've run netperf in both directions. The box I've been targeting
> is 66.230.193.105 aka wordpress1.
Ok, at least that looks good.
> The machine is a Dell 1950 with 8 x 1.6GHz Xeon 5310s, 8G RAM, and this NIC:
Nice.
> I first noticed this problem running ab; then to simplify I used
> netrate/http[d]. What's strange is that it seems fine over the local
> network (~15800 requests/sec), but it slowed down dramatically (~150
> req/sec) when tested from another network 20 ms away. Running systat
> -tcp and nload I saw that there was an almost complete stall with only
> a handful of packets being sent (probably my ssh packets) for a few
> seconds or sometimes even up to 60 seconds or so.
I think most benchmarking tools end up stalling if all of their threads
stall, that may be why the rate falls off after the misbehavior you
describe below begins.
> Nov 9 19:02:34 wordpress1 kernel: TCP: [207.210.67.2]:64851 to
> [66.230.193.105]:80; syncache_socket: Socket create failed due to
> limits or memory shortage
> Nov 9 19:02:34 wordpress1 kernel: TCP: [207.210.67.2]:64851 to
> [66.230.193.105]:80 tcpflags 0x10<ACK>; tcp_input: Listen socket:
> Socket allocation failed due to limits or memory shortage, sending RST
Turns out you'll generally get both of those error messages together, from
my reading of the code.
Since you eliminated memory shortage in the socket zone, the next thing to
check is the length of the listen queues. If the listen queue is backing
up because the application isn't accepting fast enough, the errors above
should happen. "netstat -Lan" should show you what's going on there.
Upping the specified listen queue length in your webserver _may_ be all
that is necessary. Try fiddling with that and watching how much they're
filling up during testing.
The fact that you see the same port repeatedly may indicate that the
syncache isn't destroying the syncache entries when you get the socket
creation failure. Take a look at "netstat -n" and look for SYN_RECEIVED
entries - if they're sticking around for more than a few seconds, this is
probably what's happening. (This entire paragraph is speculation, but
worth investigating.)
> I don't know if it's relevant, but accf_http is loaded on wordpress1.
That may be relevant - accepting filtering changes how the listen queues
are used. Try going back to non-accept filtering for now.
> We have seen similar behavior (TCP slowdowns) on a different machines
> (4 x Xeon 5160) with a different NIC (em0) running RELENG_7, though I
> haven't diagnosed it to this level of detail. All our RELENG_6 and
> RELENG_4 machines seem fine.
em is the driver that I was having issues with when it shared an
interrupt... :)
FWIW, my crazy theory of the moment is this: We have some bug that
happens when the listen queues overflow in 7.0, and your test is strenuous
enough to hit the listen queue overflow condition, leading to total
collapse. I'll have to cobble together a test program to see what happens
in the listen queue overflow case.
Thanks for the quick feedback,
-Mike
More information about the freebsd-net
mailing list