Long-haul problems - connections stuck in slow start

Sun Jan 5 06:17:53 UTC 2014

On 1/4/14, 7:38 PM, Peter Wemm wrote:
> We're seeing some unfortunate misbehavior with tcp over an intercontinental
> link.
>
> eg: fetching a 30GB http file from various package mirrors by a remote:
> us-west(ISC) -> london(BME)
> bd93e71c-cae4-44fd-943c-d1a88dbf6c6d.tar  0% of   29 GB  961 kBps 09h03m^C
> us-east(NYI) -> london(BME)
> bd93e71c-cae4-44fd-943c-d1a88dbf6c6d.tar  0% of   29 GB 1070 kBps 08h08m^C
> us-west(YSV) -> london(BME)
> bd93e71c-cae4-44fd-943c-d1a88dbf6c6d.tar  0% of   29 GB   14 kBps 590h22m^C
>
> Spot the one we're concerned about...
>
> Ping times for the three (in order):
> round-trip min/avg/max/std-dev = 144.330/144.532/144.797/0.157 ms
> round-trip min/avg/max/std-dev = 79.650/79.965/80.488/0.287 ms
> round-trip min/avg/max/std-dev = 148.588/153.292/155.688/2.903 ms
>
> The problem pair is worth showing some detail on:
> 16 bytes from ..:206a::1001:10, icmp_seq=4 hlim=55 time=148.588 ms
> 16 bytes from ..:206a::1001:10, icmp_seq=5 hlim=55 time=155.140 ms
> 16 bytes from ..:206a::1001:10, icmp_seq=6 hlim=55 time=149.443 ms
> 16 bytes from ..:206a::1001:10, icmp_seq=7 hlim=55 time=155.688 ms
> 16 bytes from ..:206a::1001:10, icmp_seq=8 hlim=55 time=148.630 ms
> 16 bytes from ..:206a::1001:10, icmp_seq=9 hlim=55 time=155.486 ms
> It appears that there are two packet paths between the endpoints that have
> either ~148ms or ~155ms.  I've done some longer samples and they're fairly
> consistent clusters.
>
> All four machines talk to each other.
>
> Here's where it gets interesting.  On the sender at us-west(YSV), I see this:
> net.inet.tcp.hostcache.list:
> IP address    SSTRESH    RTT RTTVAR     CWND HITS
> us-west(ISC)    59521    5ms    1ms    16845 15055031
> eu-west(BME)     7343  150ms    2ms    13501 3433775
> us-east(NYI)   530489  100ms   37ms    16681 43043786
>
> The ssthresh is very low for the problematic ysv<->bme pair.
>
> When I do a tcpdump, I see the sender fire off 7343 bytes of data, then stop
> and wait for acks.  It's completely ignoring the receiver's window state.
> It appears stuck in slowstart mode.
>
> Some other data:
> Proto Recv-Q Send-Q Local Address  Foreign Address        (state)
> tcp6       0 1047852 2001:19:2.443   2001:41c8:.24490 ESTABLISHED
>
> (netstat -x, sorry about the wrap)
> Proto Recv-Q Send-Q Local Address          Foreign Address        R-MBUF
> S-MBUF R-CLUS S-CLUS R-HIWA S-HIWA R-LOWA S-LOWA R-BCNT S-BCNT R-BMAX S-BMAX
>    rexmt persist    keep    2msl  delack rcvtime
> tcp6       0 1048152 2001:1900:2254:2.443   2001:41c8:112:83.24490      0
>   374      0    373  65688 1049580      1   2048      0 1420800 525504
> 8396640    0.43    0.00 7199.93    0.00    0.00    0.06
>
> The "interesting" parts of -x:
> rexmt persist    keep    2msl  delack rcvtime
>   0.43    0.00 7199.93    0.00    0.00    0.06
>
> -T
> Proto Rexmit OOORcv 0-win  Local Address   Foreign Address
> tcp6   54161      0      0 2001:192.443   2001:41:83.24490
> note retransmits(!)
>
> Some tcpcb fields that caught my eye for the connection:
>    snd_wnd = 1048576,
>    snd_cwnd = 5712,
>    t_srtt = 6391,
>    t_rttvar = 903,
>    t_rxtshift = 0,
>    t_rttmin = 30,
>    t_rttbest = 4903,
>    t_rttupdated = 220095,
>    max_sndwnd = 1048576,
>    snd_cwnd_prev = 4284,
>    snd_ssthresh_prev = 2856,
>    snd_recover_prev = 1397053524,
>    t_sndzerowin = 0,
>    t_badrxtwin = 584273259,
>    snd_limited = 0 '\0',
>    t_rttlow = 150,
> I've stored some dumps of the tcpcb at
>    http://people.freebsd.org/~peter/tcpcb.txt
> Note that some in the tcpcb.txt file also have
>    snd_limited = 2 '\002',
>
> Over the last few days I've tried things like turning off sack, tso, the
> various rfc knobs etc.  I believe they're all back to normal now.
>
> There's small ~15 second tcpdump sample of the sender side and the receiver
> side at: http://people.freebsd.org/~peter/send.cap.gz and
> http://people.freebsd.org/~peter/recv.cap.gz
> Both ends were ntp synced.  The dumps have no sensitive data.
>
> For amusement, I just tried this, with roughly 1 second in between:
> peter at bme:~ %	scp pkg-ysv:k.gz /tmp
> k.gz              100%   25MB   5.0MB/s   00:05
> peter at bme:~ %	scp pkg-ysv:k.gz /tmp
> k.gz                0%  960KB  20.3KB/s   41:29 ETA^C
>
> There was no pre-existing hostcache state between those two endpoints for
> the first run.  At the end, this was created in the hostcache:
> IP address   SSTRESH   RTT  RTTVAR BANDWIDTH     CWND
> 213.138..       5952 165ms    21ms         0     8688
> All connections went slow after that.  Note that the ssh test was over ipv4
> - the rest above is on ipv6.  However, we're seeing the same weird stuff
> with http over ipv4 as well between the same two endpoints.
>
> It was pointed out to me that this has come up before, eg: misc/173859
> I know we've seen this at work as well.
>
> A few days earlier we were pushing ~45MB/sec (bytes, not bits) between these
> endpoints. Out of the blue it crashed to ~10KB/sec.  Why can't it get out of
> slow-start?  Is it even stuck in slow-start like I think?  Is the 148-155ms
> bimodal rtt the problem?
>
> Any insight would be greatly appreciated.  (please don't drop me from cc:)
turn on siftr to get a "protocol'e-eye-view" of what's going on.