Re: ssh connections break with "Fssh_packet_write_wait" on 13 [SOLVED]
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Wed, 09 Jun 2021 08:30:27 UTC
> On 8 Jun, Michael Gmelin wrote: > > > > > > On Thu, 3 Jun 2021 15:09:06 +0200 > > Michael Gmelin <freebsd@grem.de> wrote: > > > >> On Tue, 1 Jun 2021 13:47:47 +0200 > >> Michael Gmelin <freebsd@grem.de> wrote: > >> > >> > Hi, > >> > > >> > Since upgrading servers from 12.2 to 13.0, I get > >> > > >> > Fssh_packet_write_wait: Connection to 1.2.3.4 port 22: Broken pipe > >> > > >> > consistently, usually after about 11 idle minutes, that's with and > >> > without pf enabled. Client (11.4 in a VM) wasn't altered. > >> > > >> > Verbose logging (client and server side) doesn't show anything > >> > special when the connection breaks. In the past, QoS problems > >> > caused these disconnects, but I didn't see anything apparent > >> > changing between 12.2 and 13 in this respect. > >> > > >> > I did a test on a newly commissioned server to rule out other > >> > factors (so, same client connections, some routes, same > >> > everything). On 12.2 before the update: Connection stays open for > >> > hours. After the update (same server): connections breaks > >> > consistently after < 15 minutes (this is with unaltered > >> > configurations, no *AliveInterval configured on either side of the > >> > connection). > >> > >> I did a little bit more testing and realized that the problem goes > >> away when I disable "Proportional Rate Reduction per RFC 6937" on the > >> server side: > >> > >> sysctl net.inet.tcp.do_prr=0 > >> > >> Keeping it on and enabling net.inet.tcp.do_prr_conservative doesn't > >> fix the problem. > >> > >> This seems to be specific to Parallels. After some more digging, I > >> realized that Parallels Desktop's NAT daemon (prl_naptd) handles > >> keep-alive between the VM and the external server on its own. There is > >> no direct communication between the client and the server. This means: > >> > >> - The NAT daemon starts sending keep-alive packages right away (not > >> after the VM's net.inet.tcp.keepidle), every 75 seconds. > >> - Keep-alive packages originating in the VM never reach the server. > >> - Keep-alive originating on the server never reaches the VM. > >> - Client and server basically do keep-alive with the nat daemon, not > >> with each other. > >> > >> It also seems like Parallels is filtering the tos field (so it's > >> always 0x00), but that's unrelated. > >> > >> I configured a bhyve VM running FreeBSD 11.4 on a separate laptop on > >> the same network for comparison and is has no such issues. > >> > >> Looking at TCP dump output on the server, this is what a keep-alive > >> package sent by Parallels looks like: > >> > >> 10:14:42.449681 IP (tos 0x0, ttl 64, id 15689, offset 0, flags > >> [none], proto TCP (6), length 40) > >> 192.168.1.1.58222 > 192.168.1.2.22: Flags [.], cksum x (correct), > >> seq 2534, ack 3851, win 4096, length 0 > >> > >> While those originating from the bhyve VM (after lowering > >> net.inet.tcp.keepidle) look like this: > >> > >> 12:18:43.105460 IP (tos 0x0, ttl 62, id 0, offset 0, flags [DF], > >> proto TCP (6), length 52) > >> 192.168.1.3.57555 > 192.168.1.2.22: Flags [.], cksum x > >> (correct), seq 1780337696, ack 45831723, win 1026, options > >> [nop,nop,TS val 3003646737 ecr 3331923346], length 0 > >> > >> Like written above, once net.inet.tcp.do_prr is disabled, keepalive > >> seems to be working just fine. Otherwise, Parallel's NAT daemon kills > >> the connection, as its keep-alive requests are not answered (well, > >> that's what I think is happening): > >> > >> 10:19:43.614803 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], > >> proto TCP (6), length 40) > >> 192.168.1.1.58222 > 192.168.1.2.22: Flags [R.], cksum x (correct), > >> seq 2535, ack 3851, win 4096, length 0 > >> > >> The easiest way to work around the problem Client side is to configure > >> ServerAliveInterval in ~/.ssh/config in the Client VM. > >> > >> I'm curious though if this is basically a Parallels problem that has > >> only been exposed by PRR being more correct (which is what I suspect), > >> or if this is actually a FreeBSD problem. > >> > > > > So, PRR probably was a red herring and the real reason that's happening > > is that FreeBSD (since version 13[0]) by default discards packets > > without timestamps for connections that formally had negotiated to have > > them. This new behavior seems to be in line with RFC 7323, section > > 3.2[1]: > > > > "Once TSopt has been successfully negotiated, that is both <SYN> and > > <SYN,ACK> contain TSopt, the TSopt MUST be sent in every non-<RST> > > segment for the duration of the connection, and SHOULD be sent in an > > <RST> segment (see Section 5.2 for details)." > > > > As it turns out, macOS does exactly this - send keep-alive packets > > without a timestamp for connections that were negotiated to have them. > > I wonder if I'm running into this with ssh connections to freefall. My > outgoing IPv6 connections pass through an ipfw firewall that uses > dynamic rules. When the dynamic rule gets close to expiration, it > generates keep alive packets that just seem to be ignored by freefall. > Eventually the dynamic rule expires, then sometime later sshd on > freefall sends a keepalive which gets dropped at my end. Verry likely: freefall:rgrimes {101} sysctl net.inet.tcp.tolerate_missing_ts net.inet.tcp.tolerate_missing_ts: 0 Can someone please flip this on freefall to =1. -- Rod Grimes rgrimes@freebsd.org