TCP can advertise a really huge window
John Baldwin
jhb at freebsd.org
Tue Feb 8 17:01:59 UTC 2011
This is a very bizarre edge case, so bear with me. I was debugging an edge
case at work recently that occurred when the socket buffer was filled up
exactly (i.e. sbspace(&so->so_rcv) == 0). In TCP terms, this would be when
rcv_nxt == rcv_adv. To simulate the real workload I had a very fast writer
blasting over lo0 to a slow reader but had used a small buffer and turned off
window scaling (as I had to fill the entire socket buffer, I was chasing an
off-by-1 bug). However, I ended up with some bizarre behavior.
I think it is less confusing to describe the sequence of events that I now
know happened than how I figured this out, so here goes.
- Assume we have advertised a window size of N which corresponds exactly to
sbspace(&so->so_rcv).
- The remote peer sends a packet of length N filling our window. We respond
with a zero-window ACK. This advances rcv_nxt to == rcv_adv, but it does
not grow rcv_adv because sbspace() is currently 0.
- The userland app very slowly drains data from the socket buffer. However,
the calls to tcp_usr_recvd() do not trigger a window update because in this
case the link is over lo0 which has a relatively large t_maxseg (about 14k)
and this condition in tcp_output() is not met:
if (adv >= (long) (2 * tp->t_maxseg))
goto send;
if (2 * adv >= (long) so->so_rcv.sb_hiwat)
goto send;
- A timer at the remote peer expires and it sends a window probe with one
byte of data. Since userland has read some data (just not 2 * MSS), we
accept this packet. However, receiving this packet moves rcv_nxt += 1,
so rcv_nxt is now > rcv_adv.
- We call tcp_output() to ACK the window probe and as part of this calculate
the receive window to advertise here:
if (recwin < (long)(so->so_rcv.sb_hiwat / 4) &&
recwin < (long)tp->t_maxseg)
recwin = 0;
if (recwin < (long)(tp->rcv_adv - tp->rcv_nxt))
recwin = (long)(tp->rcv_adv - tp->rcv_nxt);
if (recwin > (long)TCP_MAXWIN << tp->rcv_scale)
recwin = (long)TCP_MAXWIN << tp->rcv_scale;
The "surprise" kicks in on the second conditional. The problem is that
rcv_adv - rcv_nxt is now equal to (uint32_t)-1. On a 32-bit machine the cast
to (long) effectively just makes this value signed and thus -1. On a 64-bit
machine you actually end up with a ginormous value of 2^32 - 1, or a 4GB
window (minus a byte). The third conditional truncates that to the maximum
window we can advertise, but this value may be larger than the actual space in
the socket buffer. The remote peer now has a huge window to throw data into.
At work this proved disastrous. I'm not sure if there are any practical
concerns. This is the patch I'm using as a fix:
Index: tcp_output.c
===================================================================
--- tcp_output.c (revision 215582)
+++ tcp_output.c (working copy)
@@ -928,7 +928,8 @@
if (recwin < (long)(so->so_rcv.sb_hiwat / 4) &&
recwin < (long)tp->t_maxseg)
recwin = 0;
- if (recwin < (long)(tp->rcv_adv - tp->rcv_nxt))
+ if (SEQ_GT(tp->rcv_adv, tp->rcv_nxt) &&
+ recwin < (long)(tp->rcv_adv - tp->rcv_nxt))
recwin = (long)(tp->rcv_adv - tp->rcv_nxt);
if (recwin > (long)TCP_MAXWIN << tp->rcv_scale)
recwin = (long)TCP_MAXWIN << tp->rcv_scale;
--
John Baldwin
More information about the freebsd-net
mailing list