Should syncache.count ever be negative?
Matt Reimer
mattjreimer at gmail.com
Sat Nov 10 01:13:20 PST 2007
On Nov 10, 2007 12:13 AM, Mike Silbersack <silby at silby.com> wrote:
>
> On Fri, 9 Nov 2007, Matt Reimer wrote:
>
> > I first noticed this problem running ab; then to simplify I used
> > netrate/http[d]. What's strange is that it seems fine over the local
> > network (~15800 requests/sec), but it slowed down dramatically (~150
> > req/sec) when tested from another network 20 ms away. Running systat
> > -tcp and nload I saw that there was an almost complete stall with only
> > a handful of packets being sent (probably my ssh packets) for a few
> > seconds or sometimes even up to 60 seconds or so.
>
> I think most benchmarking tools end up stalling if all of their threads
> stall, that may be why the rate falls off after the misbehavior you
> describe below begins.
Ok. FWIW, I'm seeing the same behavior with tools/netrate/http as I am with ab.
> > Nov 9 19:02:34 wordpress1 kernel: TCP: [207.210.67.2]:64851 to
> > [66.230.193.105]:80; syncache_socket: Socket create failed due to
> > limits or memory shortage
> > Nov 9 19:02:34 wordpress1 kernel: TCP: [207.210.67.2]:64851 to
> > [66.230.193.105]:80 tcpflags 0x10<ACK>; tcp_input: Listen socket:
> > Socket allocation failed due to limits or memory shortage, sending RST
>
> Turns out you'll generally get both of those error messages together, from
> my reading of the code.
>
> Since you eliminated memory shortage in the socket zone, the next thing to
> check is the length of the listen queues. If the listen queue is backing
> up because the application isn't accepting fast enough, the errors above
> should happen. "netstat -Lan" should show you what's going on there.
> Upping the specified listen queue length in your webserver _may_ be all
> that is necessary. Try fiddling with that and watching how much they're
> filling up during testing.
I ran "netstat -Lan" every second while running this test and the
output never changed from the following, whether before or after the
stall:
Current listen queue sizes (qlen/incqlen/maxqlen)
Proto Listen Local Address
tcp4 0/0/128 66.230.193.105.80
tcp4 0/0/10 127.0.0.1.25
tcp4 0/0/128 *.22
tcp4 0/0/128 *.199
> The fact that you see the same port repeatedly may indicate that the
> syncache isn't destroying the syncache entries when you get the socket
> creation failure. Take a look at "netstat -n" and look for SYN_RECEIVED
> entries - if they're sticking around for more than a few seconds, this is
> probably what's happening. (This entire paragraph is speculation, but
> worth investigating.)
During the stall the sockets are all in TIME_WAIT. More relevant info:
kern.ipc.maxsockets: 12328
kern.ipc.numopensockets: 46
net.inet.ip.portrange.randomtime: 45
net.inet.ip.portrange.randomcps: 10
net.inet.ip.portrange.randomized: 1
net.inet.ip.portrange.reservedlow: 0
net.inet.ip.portrange.reservedhigh: 1023
net.inet.ip.portrange.hilast: 65535
net.inet.ip.portrange.hifirst: 49152
net.inet.ip.portrange.last: 65535
net.inet.ip.portrange.first: 30000
net.inet.ip.portrange.lowlast: 600
net.inet.ip.portrange.lowfirst: 1023
net.inet.tcp.finwait2_timeout: 60000
net.inet.tcp.fast_finwait2_recycle: 0
[root at wordpress1 /sys/dev]# netstat -m
513/5382/5895 mbufs in use (current/cache/total)
511/3341/3852/25600 mbuf clusters in use (current/cache/total/max)
1/1663 mbuf+clusters out of packet secondary zone in use (current/cache)
0/488/488/0 4k (page size) jumbo clusters in use (current/cache/total/max)
0/0/0/0 9k jumbo clusters in use (current/cache/total/max)
0/0/0/0 16k jumbo clusters in use (current/cache/total/max)
1150K/9979K/11129K bytes allocated to network (current/cache/total)
0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters denied (4k/9k/16k)
0/0/0 sfbufs in use (current/peak/max)
0 requests for sfbufs denied
0 requests for sfbufs delayed
17 requests for I/O initiated by sendfile
0 calls to protocol drain routines
> > I don't know if it's relevant, but accf_http is loaded on wordpress1.
>
> That may be relevant - accepting filtering changes how the listen queues
> are used. Try going back to non-accept filtering for now.
It still stalls. This time I noticed that tcptw shows 0 free:
socket: 696, 12330, 14, 126, 10749, 0
unpcb: 248, 12330, 5, 70, 75, 0
ipq: 56, 819, 0, 0, 0, 0
udpcb: 280, 12334, 2, 40, 184, 0
inpcb: 280, 12334, 2485, 105, 10489, 0
tcpcb: 688, 12330, 7, 73, 10489, 0
tcptw: 88, 2478, 2478, 0, 2478, 7231
syncache: 112, 15378, 1, 65, 9713, 0
hostcache: 136, 15372, 0, 0, 0, 0
tcpreass: 40, 1680, 0, 0, 0, 0
sackhole: 32, 0, 0, 0, 0, 0
But even while tcptw shows 0 free, I can still blast 15800 req/s from
another RELENG_7 box to this one during the stall. So I don't know if
that means anything.
> FWIW, my crazy theory of the moment is this: We have some bug that
> happens when the listen queues overflow in 7.0, and your test is strenuous
> enough to hit the listen queue overflow condition, leading to total
> collapse. I'll have to cobble together a test program to see what happens
> in the listen queue overflow case.
When I use ab I'm telling it to use a max of 100 simultaneous
connections (ab -c 100 -n 50000 http://66.230.193.105/). Wouldn't that
be well under the limit?
> Thanks for the quick feedback,
Thank *you*.
Matt
More information about the freebsd-net
mailing list