Long-haul problems - connections stuck in slow start

Sat Jan 4 18:38:57 UTC 2014

We're seeing some unfortunate misbehavior with tcp over an intercontinental
link.

eg: fetching a 30GB http file from various package mirrors by a remote:
us-west(ISC) -> london(BME)
bd93e71c-cae4-44fd-943c-d1a88dbf6c6d.tar  0% of   29 GB  961 kBps 09h03m^C
us-east(NYI) -> london(BME)
bd93e71c-cae4-44fd-943c-d1a88dbf6c6d.tar  0% of   29 GB 1070 kBps 08h08m^C
us-west(YSV) -> london(BME)
bd93e71c-cae4-44fd-943c-d1a88dbf6c6d.tar  0% of   29 GB   14 kBps 590h22m^C

Spot the one we're concerned about...

Ping times for the three (in order):
round-trip min/avg/max/std-dev = 144.330/144.532/144.797/0.157 ms
round-trip min/avg/max/std-dev = 79.650/79.965/80.488/0.287 ms
round-trip min/avg/max/std-dev = 148.588/153.292/155.688/2.903 ms

The problem pair is worth showing some detail on:
16 bytes from ..:206a::1001:10, icmp_seq=4 hlim=55 time=148.588 ms
16 bytes from ..:206a::1001:10, icmp_seq=5 hlim=55 time=155.140 ms
16 bytes from ..:206a::1001:10, icmp_seq=6 hlim=55 time=149.443 ms
16 bytes from ..:206a::1001:10, icmp_seq=7 hlim=55 time=155.688 ms
16 bytes from ..:206a::1001:10, icmp_seq=8 hlim=55 time=148.630 ms
16 bytes from ..:206a::1001:10, icmp_seq=9 hlim=55 time=155.486 ms
It appears that there are two packet paths between the endpoints that have
either ~148ms or ~155ms.  I've done some longer samples and they're fairly
consistent clusters.

All four machines talk to each other.

Here's where it gets interesting.  On the sender at us-west(YSV), I see this:
net.inet.tcp.hostcache.list:
IP address    SSTRESH    RTT RTTVAR     CWND HITS
us-west(ISC)    59521    5ms    1ms    16845 15055031
eu-west(BME)     7343  150ms    2ms    13501 3433775
us-east(NYI)   530489  100ms   37ms    16681 43043786

The ssthresh is very low for the problematic ysv<->bme pair.

When I do a tcpdump, I see the sender fire off 7343 bytes of data, then stop
and wait for acks.  It's completely ignoring the receiver's window state.
It appears stuck in slowstart mode.

Some other data:
Proto Recv-Q Send-Q Local Address  Foreign Address        (state)
tcp6       0 1047852 2001:19:2.443   2001:41c8:.24490 ESTABLISHED

(netstat -x, sorry about the wrap)
Proto Recv-Q Send-Q Local Address          Foreign Address        R-MBUF
S-MBUF R-CLUS S-CLUS R-HIWA S-HIWA R-LOWA S-LOWA R-BCNT S-BCNT R-BMAX S-BMAX
  rexmt persist    keep    2msl  delack rcvtime
tcp6       0 1048152 2001:1900:2254:2.443   2001:41c8:112:83.24490      0
 374      0    373  65688 1049580      1   2048      0 1420800 525504
8396640    0.43    0.00 7199.93    0.00    0.00    0.06

The "interesting" parts of -x:
rexmt persist    keep    2msl  delack rcvtime
 0.43    0.00 7199.93    0.00    0.00    0.06

-T
Proto Rexmit OOORcv 0-win  Local Address   Foreign Address
tcp6   54161      0      0 2001:192.443   2001:41:83.24490
note retransmits(!)

Some tcpcb fields that caught my eye for the connection:
  snd_wnd = 1048576,
  snd_cwnd = 5712,
  t_srtt = 6391,
  t_rttvar = 903,
  t_rxtshift = 0,
  t_rttmin = 30,
  t_rttbest = 4903,
  t_rttupdated = 220095,
  max_sndwnd = 1048576,
  snd_cwnd_prev = 4284,
  snd_ssthresh_prev = 2856,
  snd_recover_prev = 1397053524,
  t_sndzerowin = 0,
  t_badrxtwin = 584273259,
  snd_limited = 0 '\0',
  t_rttlow = 150,
I've stored some dumps of the tcpcb at
  http://people.freebsd.org/~peter/tcpcb.txt
Note that some in the tcpcb.txt file also have
  snd_limited = 2 '\002',

Over the last few days I've tried things like turning off sack, tso, the
various rfc knobs etc.  I believe they're all back to normal now.

There's small ~15 second tcpdump sample of the sender side and the receiver
side at: http://people.freebsd.org/~peter/send.cap.gz and
http://people.freebsd.org/~peter/recv.cap.gz
Both ends were ntp synced.  The dumps have no sensitive data.

For amusement, I just tried this, with roughly 1 second in between:
peter at bme:~ %	scp pkg-ysv:k.gz /tmp
k.gz              100%   25MB   5.0MB/s   00:05
peter at bme:~ %	scp pkg-ysv:k.gz /tmp
k.gz                0%  960KB  20.3KB/s   41:29 ETA^C

There was no pre-existing hostcache state between those two endpoints for
the first run.  At the end, this was created in the hostcache:
IP address   SSTRESH   RTT  RTTVAR BANDWIDTH     CWND
213.138..       5952 165ms    21ms         0     8688
All connections went slow after that.  Note that the ssh test was over ipv4
- the rest above is on ipv6.  However, we're seeing the same weird stuff
with http over ipv4 as well between the same two endpoints.

It was pointed out to me that this has come up before, eg: misc/173859
I know we've seen this at work as well.

A few days earlier we were pushing ~45MB/sec (bytes, not bits) between these
endpoints. Out of the blue it crashed to ~10KB/sec.  Why can't it get out of
slow-start?  Is it even stuck in slow-start like I think?  Is the 148-155ms
bimodal rtt the problem?

Any insight would be greatly appreciated.  (please don't drop me from cc:)
-- 
Peter Wemm - peter at wemm.org; peter at FreeBSD.org; peter at yahoo-inc.com; KI6FJV
UTF-8: for when a ' just won\342\200\231t do.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 246 bytes
Desc: OpenPGP digital signature
URL: <http://lists.freebsd.org/pipermail/freebsd-net/attachments/20140104/cbf521a2/attachment.sig>