From nobody Tue Aug 23 13:14:16 2022 X-Original-To: dev-commits-src-all@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4MBqST1H8vz4Zdpp; Tue, 23 Aug 2022 13:14:17 +0000 (UTC) (envelope-from git@FreeBSD.org) Received: from mxrelay.nyi.freebsd.org (mxrelay.nyi.freebsd.org [IPv6:2610:1c1:1:606c::19:3]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "mxrelay.nyi.freebsd.org", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4MBqSS6ntcz45Lj; Tue, 23 Aug 2022 13:14:16 +0000 (UTC) (envelope-from git@FreeBSD.org) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1661260457; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=9Pz17MIpMLLNZJAveC0HODikBP0cSYWyPh3XTcWQkBs=; b=RDcjY5Ye7n+THC9/PZ1sKNvvWcgMHe59XYpiylT9KHH6FF1YFCH+0jihuHK4TL7GXlTyP5 z4C34hVC18yGesxiIN6qrnOJ8sk/FM37A3tTLQJTqjG2G6cCnVQIgBHskA5vyRWSiSHZJQ PVd8LGsOvdCT12wzxLEzdD5Pu4QIzX0KJ84DWw7jgKpnhC5S7jbfksfH+ByzWdOsx8p77f KH/H1jD79J0mt/Ns2S97PibFVXiIWxO++1O91NoG9G9Ge09G9P+ZV5ycTRrwQBeCQOhdEJ 3x44GgHWkQkgafkAcH2wZhamxxECY0B1JbPg3H3bfjkprYzb2iG4x+WhwBUlvg== Received: from gitrepo.freebsd.org (gitrepo.freebsd.org [IPv6:2610:1c1:1:6068::e6a:5]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mxrelay.nyi.freebsd.org (Postfix) with ESMTPS id 4MBqSS5VZLzWHl; Tue, 23 Aug 2022 13:14:16 +0000 (UTC) (envelope-from git@FreeBSD.org) Received: from gitrepo.freebsd.org ([127.0.1.44]) by gitrepo.freebsd.org (8.16.1/8.16.1) with ESMTP id 27NDEGCZ015023; Tue, 23 Aug 2022 13:14:16 GMT (envelope-from git@gitrepo.freebsd.org) Received: (from git@localhost) by gitrepo.freebsd.org (8.16.1/8.16.1/Submit) id 27NDEGlM015022; Tue, 23 Aug 2022 13:14:16 GMT (envelope-from git) Date: Tue, 23 Aug 2022 13:14:16 GMT Message-Id: <202208231314.27NDEGlM015022@gitrepo.freebsd.org> To: src-committers@FreeBSD.org, dev-commits-src-all@FreeBSD.org, dev-commits-src-main@FreeBSD.org From: Randall Stewart Subject: git: 4e0ce82b53e2 - main - TCP Lro has a loss of timestamp precision and reorders packets. List-Id: Commit messages for all branches of the src repository List-Archive: https://lists.freebsd.org/archives/dev-commits-src-all List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-dev-commits-src-all@freebsd.org X-BeenThere: dev-commits-src-all@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Git-Committer: rrs X-Git-Repository: src X-Git-Refname: refs/heads/main X-Git-Reftype: branch X-Git-Commit: 4e0ce82b53e2a8552fd44c6a4d4de986e11135f1 Auto-Submitted: auto-generated ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1661260457; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=9Pz17MIpMLLNZJAveC0HODikBP0cSYWyPh3XTcWQkBs=; b=pk83R3MOmVjNtxNjyBpIy37nzubjSqZuLZQ6wQI3QWwSUJJmaena4OQ2okVYQwN74nYuoT ZkaO1PVoPEumh56g9fMbqn4VVI9tWOOZCyVk2pcq7wyZ64aS8lkR+5Fj4CphB16H+l/WZ0 yvjNx5WQwzaf2xbnpt2pD/xlRu+2ZcSeaCVNTbB/LVcKyeptSRYECUF1gf/QDGJ3gUMios vi4rAU6Jk8xpCgBFEaxplPeOw0ramV0QF+xbSzePp1i2ztSl1QN0nAqP04blGgBWMMEIXT bxOQWmAMgdyMKchV926NRB48NTBllT3vD0isvWbADE0sThhotmZboZebKfv7zA== ARC-Seal: i=1; s=dkim; d=freebsd.org; t=1661260457; a=rsa-sha256; cv=none; b=ga3TgAq5zlwWQWL3f0oSTQisfALwkhEEbTbW7BoHVRlhv9Mkne75v5rXBlgS3WWnTSHLDV 8N3MNIfd6t5n7bgx/n5ixeqsPcNj59yYOQOHba2pKy2/GZrHHI3HCZd25YOjUX1y5dD0mY 3gH3mW9vOLLGqyRLaq0sIcTYn75qYqLgz3TxrnkVGVbstHhw6KlIu/uRtNwcTv8u41b55E egMd0LBTK1iiBz5PGlml1UGRofBcbU2KGiq/B/gp8cEUFpudxlud7o8AmyMUbQ5IHKaY6j 5dmlSjBgBqT6I6y3hAvjwYYdH5UsXhDiQTyOY2yqBkduUNWF6QelSTPzgL3YKw== ARC-Authentication-Results: i=1; mx1.freebsd.org; none X-ThisMailContainsUnwantedMimeParts: N The branch main has been updated by rrs: URL: https://cgit.FreeBSD.org/src/commit/?id=4e0ce82b53e2a8552fd44c6a4d4de986e11135f1 commit 4e0ce82b53e2a8552fd44c6a4d4de986e11135f1 Author: Randall Stewart AuthorDate: 2022-08-23 13:12:31 +0000 Commit: Randall Stewart CommitDate: 2022-08-23 13:12:31 +0000 TCP Lro has a loss of timestamp precision and reorders packets. A while back Hans optimized the LRO code. This is great but one optimization he did degrades the timestamp precision so that all flushed LRO entries end up with the same LRO timestamp if there is not a hardware timestamp. The intent of the LRO timestamp is to get as close to the time that the packet arrived as possible. Without the LRO queuing this works out fine since a binuptime is taken and then the rx_common code is called. But when you go through the queue path you end up *not* updating the M_LRO_TSTMP fields. Another issue in the LRO code is several places that cause packet reordering. In general TCP can handle reordering but it can cause extra un-needed retransmission as well as other oddities. We will fix all of the reordering problems. Lets fix this so that we restore the precision to the timestamp. Reviewed by: tuexen, gallatin Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D36043 --- sys/netinet/tcp_lro.c | 110 ++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 89 insertions(+), 21 deletions(-) diff --git a/sys/netinet/tcp_lro.c b/sys/netinet/tcp_lro.c index fcde002bac53..a4fc5580dfc7 100644 --- a/sys/netinet/tcp_lro.c +++ b/sys/netinet/tcp_lro.c @@ -91,7 +91,7 @@ static int tcp_lro_rx_common(struct lro_ctrl *lc, struct mbuf *m, #ifdef TCPHPTS static bool do_bpf_strip_and_compress(struct inpcb *, struct lro_ctrl *, struct lro_entry *, struct mbuf **, struct mbuf **, struct mbuf **, - bool *, bool, bool, struct ifnet *); + bool *, bool, bool, struct ifnet *, bool); #endif @@ -119,6 +119,11 @@ SYSCTL_UINT(_net_inet_tcp_lro, OID_AUTO, lro_cpu_threshold, CTLFLAG_RDTUN | CTLFLAG_MPSAFE, &tcp_lro_cpu_set_thresh, 0, "Number of interrupts in a row on the same CPU that will make us declare an 'affinity' cpu?"); +static uint32_t tcp_less_accurate_lro_ts = 0; +SYSCTL_UINT(_net_inet_tcp_lro, OID_AUTO, lro_less_accurate, + CTLFLAG_MPSAFE, &tcp_less_accurate_lro_ts, 0, + "Do we trade off efficency by doing less timestamp operations for time accuracy?"); + SYSCTL_COUNTER_U64(_net_inet_tcp_lro, OID_AUTO, fullqueue, CTLFLAG_RD, &tcp_inp_lro_direct_queue, "Number of lro's fully queued to transport"); SYSCTL_COUNTER_U64(_net_inet_tcp_lro, OID_AUTO, wokeup, CTLFLAG_RD, @@ -598,6 +603,29 @@ tcp_lro_rx_done(struct lro_ctrl *lc) } } +static void +tcp_lro_flush_active(struct lro_ctrl *lc) +{ + struct lro_entry *le; + + /* + * Walk through the list of le entries, and + * any one that does have packets flush. This + * is called because we have an inbound packet + * (e.g. SYN) that has to have all others flushed + * in front of it. Note we have to do the remove + * because tcp_lro_flush() assumes that the entry + * is being freed. This is ok it will just get + * reallocated again like it was new. + */ + LIST_FOREACH(le, &lc->lro_active, next) { + if (le->m_head != NULL) { + tcp_lro_active_remove(le); + tcp_lro_flush(lc, le); + } + } +} + void tcp_lro_flush_inactive(struct lro_ctrl *lc, const struct timeval *timeout) { @@ -664,9 +692,9 @@ tcp_lro_log(struct tcpcb *tp, const struct lro_ctrl *lc, log.u_bbr.flex2 = m->m_pkthdr.len; else log.u_bbr.flex2 = 0; - log.u_bbr.flex3 = le->m_head->m_pkthdr.lro_nsegs; - log.u_bbr.flex4 = le->m_head->m_pkthdr.lro_tcp_d_len; if (le->m_head) { + log.u_bbr.flex3 = le->m_head->m_pkthdr.lro_nsegs; + log.u_bbr.flex4 = le->m_head->m_pkthdr.lro_tcp_d_len; log.u_bbr.flex5 = le->m_head->m_pkthdr.len; log.u_bbr.delRate = le->m_head->m_flags; log.u_bbr.rttProp = le->m_head->m_pkthdr.rcv_tstmp; @@ -1161,7 +1189,7 @@ tcp_queue_pkts(struct inpcb *inp, struct tcpcb *tp, struct lro_entry *le) static struct mbuf * tcp_lro_get_last_if_ackcmp(struct lro_ctrl *lc, struct lro_entry *le, - struct inpcb *inp, int32_t *new_m) + struct inpcb *inp, int32_t *new_m, bool can_append_old_cmp) { struct tcpcb *tp; struct mbuf *m; @@ -1171,19 +1199,22 @@ tcp_lro_get_last_if_ackcmp(struct lro_ctrl *lc, struct lro_entry *le, return (NULL); /* Look at the last mbuf if any in queue */ - m = tp->t_tail_pkt; - if (m != NULL && (m->m_flags & M_ACKCMP) != 0) { - if (M_TRAILINGSPACE(m) >= sizeof(struct tcp_ackent)) { - tcp_lro_log(tp, lc, le, NULL, 23, 0, 0, 0, 0); - *new_m = 0; - counter_u64_add(tcp_extra_mbuf, 1); - return (m); - } else { - /* Mark we ran out of space */ - inp->inp_flags2 |= INP_MBUF_L_ACKS; + if (can_append_old_cmp) { + m = tp->t_tail_pkt; + if (m != NULL && (m->m_flags & M_ACKCMP) != 0) { + if (M_TRAILINGSPACE(m) >= sizeof(struct tcp_ackent)) { + tcp_lro_log(tp, lc, le, NULL, 23, 0, 0, 0, 0); + *new_m = 0; + counter_u64_add(tcp_extra_mbuf, 1); + return (m); + } else { + /* Mark we ran out of space */ + inp->inp_flags2 |= INP_MBUF_L_ACKS; + } } } /* Decide mbuf size. */ + tcp_lro_log(tp, lc, le, NULL, 21, 0, 0, 0, 0); if (inp->inp_flags2 & INP_MBUF_L_ACKS) m = m_getcl(M_NOWAIT, MT_DATA, M_ACKCMP | M_PKTHDR); else @@ -1194,6 +1225,7 @@ tcp_lro_get_last_if_ackcmp(struct lro_ctrl *lc, struct lro_entry *le, return (NULL); } counter_u64_add(tcp_comp_total, 1); + m->m_pkthdr.rcvif = lc->ifp; m->m_flags |= M_ACKCMP; *new_m = 1; return (m); @@ -1290,7 +1322,7 @@ tcp_lro_flush_tcphpts(struct lro_ctrl *lc, struct lro_entry *le) struct tcpcb *tp; struct mbuf **pp, *cmp, *mv_to; struct ifnet *lagg_ifp; - bool bpf_req, lagg_bpf_req, should_wake; + bool bpf_req, lagg_bpf_req, should_wake, can_append_old_cmp; /* Check if packet doesn't belongs to our network interface. */ if ((tcplro_stacks_wanting_mbufq == 0) || @@ -1361,18 +1393,33 @@ tcp_lro_flush_tcphpts(struct lro_ctrl *lc, struct lro_entry *le) } /* Strip and compress all the incoming packets. */ + can_append_old_cmp = true; cmp = NULL; for (pp = &le->m_head; *pp != NULL; ) { mv_to = NULL; if (do_bpf_strip_and_compress(inp, lc, le, pp, &cmp, &mv_to, &should_wake, bpf_req, - lagg_bpf_req, lagg_ifp) == false) { + lagg_bpf_req, lagg_ifp, can_append_old_cmp) == false) { /* Advance to next mbuf. */ pp = &(*pp)->m_nextpkt; + /* + * Once we have appended we can't look in the pending + * inbound packets for a compressed ack to append to. + */ + can_append_old_cmp = false; + /* + * Once we append we also need to stop adding to any + * compressed ack we were remembering. A new cmp + * ack will be required. + */ + cmp = NULL; + tcp_lro_log(tp, lc, le, NULL, 25, 0, 0, 0, 0); } else if (mv_to != NULL) { /* We are asked to move pp up */ pp = &mv_to->m_nextpkt; - } + tcp_lro_log(tp, lc, le, NULL, 24, 0, 0, 0, 0); + } else + tcp_lro_log(tp, lc, le, NULL, 26, 0, 0, 0, 0); } /* Update "m_last_mbuf", if any. */ if (pp == &le->m_head) @@ -1562,6 +1609,8 @@ tcp_lro_flush_all(struct lro_ctrl *lc) /* add packet to LRO engine */ if (tcp_lro_rx_common(lc, mb, 0, false) != 0) { + /* Flush anything we have acummulated */ + tcp_lro_flush_active(lc); /* input packet to network layer */ (*lc->ifp->if_input)(lc->ifp, mb); lc->lro_queued++; @@ -1589,10 +1638,11 @@ build_ack_entry(struct tcp_ackent *ae, struct tcphdr *th, struct mbuf *m, * entry. */ ae->timestamp = m->m_pkthdr.rcv_tstmp; + ae->flags = 0; if (m->m_flags & M_TSTMP_LRO) - ae->flags = TSTMP_LRO; + ae->flags |= TSTMP_LRO; else if (m->m_flags & M_TSTMP) - ae->flags = TSTMP_HDWR; + ae->flags |= TSTMP_HDWR; ae->seq = ntohl(th->th_seq); ae->ack = ntohl(th->th_ack); ae->flags |= tcp_get_flags(th); @@ -1612,7 +1662,7 @@ build_ack_entry(struct tcp_ackent *ae, struct tcphdr *th, struct mbuf *m, static bool do_bpf_strip_and_compress(struct inpcb *inp, struct lro_ctrl *lc, struct lro_entry *le, struct mbuf **pp, struct mbuf **cmp, struct mbuf **mv_to, - bool *should_wake, bool bpf_req, bool lagg_bpf_req, struct ifnet *lagg_ifp) + bool *should_wake, bool bpf_req, bool lagg_bpf_req, struct ifnet *lagg_ifp, bool can_append_old_cmp) { union { void *ptr; @@ -1718,7 +1768,7 @@ do_bpf_strip_and_compress(struct inpcb *inp, struct lro_ctrl *lc, /* Now lets get space if we don't have some already */ if (*cmp == NULL) { new_one: - nm = tcp_lro_get_last_if_ackcmp(lc, le, inp, &n_mbuf); + nm = tcp_lro_get_last_if_ackcmp(lc, le, inp, &n_mbuf, can_append_old_cmp); if (__predict_false(nm == NULL)) goto done; *cmp = nm; @@ -1954,6 +2004,14 @@ tcp_lro_rx(struct lro_ctrl *lc, struct mbuf *m, uint32_t csum) binuptime(&lc->lro_last_queue_time); CURVNET_SET(lc->ifp->if_vnet); error = tcp_lro_rx_common(lc, m, csum, true); + if (__predict_false(error != 0)) { + /* + * Flush anything we have acummulated + * ahead of this packet that can't + * be LRO'd. This preserves order. + */ + tcp_lro_flush_active(lc); + } CURVNET_RESTORE(); return (error); @@ -1978,6 +2036,16 @@ tcp_lro_queue_mbuf(struct lro_ctrl *lc, struct mbuf *mb) return; } + /* If no hardware or arrival stamp on the packet add timestamp */ + if ((tcplro_stacks_wanting_mbufq > 0) && + (tcp_less_accurate_lro_ts == 0) && + ((mb->m_flags & M_TSTMP) == 0)) { + /* Add in an LRO time since no hardware */ + binuptime(&lc->lro_last_queue_time); + mb->m_pkthdr.rcv_tstmp = bintime2ns(&lc->lro_last_queue_time); + mb->m_flags |= M_TSTMP_LRO; + } + /* create sequence number */ lc->lro_mbuf_data[lc->lro_mbuf_count].seq = (((uint64_t)M_HASHTYPE_GET(mb)) << 56) |