From nobody Wed Nov 20 21:41:19 2024 X-Original-To: dev-commits-src-all@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4Xtvv50TKWz5dn7H; Wed, 20 Nov 2024 21:41:21 +0000 (UTC) (envelope-from git@FreeBSD.org) Received: from mxrelay.nyi.freebsd.org (mxrelay.nyi.freebsd.org [IPv6:2610:1c1:1:606c::19:3]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "mxrelay.nyi.freebsd.org", Issuer "R10" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4Xtvv31cKVz4vpD; Wed, 20 Nov 2024 21:41:19 +0000 (UTC) (envelope-from git@FreeBSD.org) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1732138879; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=oUHydZZ7kDAzaqe7/Dy4PWUNzldW39Y/aIj+upo7QSg=; b=GS5yV3iozCnSFBgrkAI5Q3dF/sIaejX25K6OMKCqJ39xyEKthAGddRXErg1AqJbWMhbBYQ WyJMSJBzfZ3YUst+Dg6Je+8W5/XjV0pg4zREIxJOHJNEt3MM7kTcN3VW2QReVjJqVu4HWv CNed5pOEKLk/B72As4XwWZzAX9K8vzYYMkOVRP/70EoDC2FX8k+qU4iifPT+4E+gjv1fUH oAE6gkHmvHdKQCNHuZkQowvQcG+q7O2+3Jq6hi/9CPKrNMdsrL4dNusOs5GZZ/gqK43Iz3 ikC3+OFFm1X2P6rNj7XBHb0p75WdFWhZPo/TZCpFUnWxRZMKxJkhBLnTO/OojQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1732138879; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=oUHydZZ7kDAzaqe7/Dy4PWUNzldW39Y/aIj+upo7QSg=; b=D2mWH+bg6Ve+1gw9mqpvtU3wXSlYivaTfePoGAxxGFk8ESq4QGoCSgQgKbHD2WgCmxpjjp DIKA1+V2srZj+milCblItH2GkC5om97z6RESdjXLFlAZFNnAT4qYNITpwYOytQih3ETBQv xwKeqxcOVDiUXZIAhMSW+aSmCI0SwhpI0KLB0aH/roMvz7FWifPjXGCOZZpqc7Mo+jGWq7 3yVgn464M6UTQw9in2JNMddN9jkAkIeq3h9lg4WgcA8BQtCosAjWlJvFzC19//owTE2lPH T1ZHhON3x/28m47+3RM8x3GLjjfuD7FW8s+6k4/RAe+GRfVaBH3GpgdeWTbayg== ARC-Authentication-Results: i=1; mx1.freebsd.org; none ARC-Seal: i=1; s=dkim; d=freebsd.org; t=1732138879; a=rsa-sha256; cv=none; b=sEd0y/Wl7N57uBg1SYnjll8C7goRVJQt1vviLE8ViwMNfWVIG51Hx2zK3zI6SEiYfztRFH wgft7MW55dGdPsSYwso5HqRJN+bi/Q7Zlb6w285y2GmPK1A+6Jt+pJFlx6RBY2pvyyA2vP IzkX1zYU7/PI/QBrh7X35JSN9iPHOrthf3z2TdhXdKD5bpegh9TkgKUIDRTVU1rR0r0BDZ idSvyfXC/RgQTaC0CaQJuz5m2g1RtIJW1TrH1/qIxUXWyNpw/WJrEKBRLRqGkuhDyOb/2F SKGfXvlaVQ5rmXhtmou3qjjw/R9t8SS4GoXxqmFQGH/MRRSjfTfUIZcoWP6sFA== Received: from gitrepo.freebsd.org (gitrepo.freebsd.org [IPv6:2610:1c1:1:6068::e6a:5]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mxrelay.nyi.freebsd.org (Postfix) with ESMTPS id 4Xtvv315Ryz1B99; Wed, 20 Nov 2024 21:41:19 +0000 (UTC) (envelope-from git@FreeBSD.org) Received: from gitrepo.freebsd.org ([127.0.1.44]) by gitrepo.freebsd.org (8.18.1/8.18.1) with ESMTP id 4AKLfJRd022073; Wed, 20 Nov 2024 21:41:19 GMT (envelope-from git@gitrepo.freebsd.org) Received: (from git@localhost) by gitrepo.freebsd.org (8.18.1/8.18.1/Submit) id 4AKLfJ6D022070; Wed, 20 Nov 2024 21:41:19 GMT (envelope-from git) Date: Wed, 20 Nov 2024 21:41:19 GMT Message-Id: <202411202141.4AKLfJ6D022070@gitrepo.freebsd.org> To: src-committers@FreeBSD.org, dev-commits-src-all@FreeBSD.org, dev-commits-src-branches@FreeBSD.org From: Mark Johnston Subject: git: 3afdae488574 - stable/14 - gve: Fix TX livelock List-Id: Commit messages for all branches of the src repository List-Archive: https://lists.freebsd.org/archives/dev-commits-src-all List-Help: List-Post: List-Subscribe: List-Unsubscribe: X-BeenThere: dev-commits-src-all@freebsd.org Sender: owner-dev-commits-src-all@FreeBSD.org MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Git-Committer: markj X-Git-Repository: src X-Git-Refname: refs/heads/stable/14 X-Git-Reftype: branch X-Git-Commit: 3afdae4885743a2ed5cec3f08d8d8ec9e9376b59 Auto-Submitted: auto-generated The branch stable/14 has been updated by markj: URL: https://cgit.FreeBSD.org/src/commit/?id=3afdae4885743a2ed5cec3f08d8d8ec9e9376b59 commit 3afdae4885743a2ed5cec3f08d8d8ec9e9376b59 Author: Shailend Chand AuthorDate: 2024-11-05 19:38:31 +0000 Commit: Mark Johnston CommitDate: 2024-11-20 21:41:08 +0000 gve: Fix TX livelock Before this change the transmit taskqueue would enqueue itself when it cannot find space on the NIC ring with the hope that eventually space would be made. This results in the following livelock that only occurs after passing ~200Gbps of TCP traffic for many hours: 100% CPU ┌───────────┐wait on ┌──────────┐ ┌───────────┐ │user thread│ cpu │gve xmit │wait on │gve cleanup│ │with mbuf ├────────►│taskqueue ├────────►│taskqueue │ │uma lock │ │ │ NIC ring│ │ └───────────┘ └──────────┘ space └─────┬─────┘ ▲ │ │ wait on mbuf uma lock │ └───────────────────────────────────────────┘ Further details about the livelock are available on https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=281560. After this change, the transmit taskqueue no longer spins till there is room on the NIC ring. It instead stops itself and lets the completion-processing taskqueue wake it up. Since I'm touching the trasnmit taskqueue I've also corrected the name of a counter and also fixed a bug where EINVAL mbufs were not being freed and were instead living forever on the bufring. Signed-off-by: Shailend Chand Reviewed-by: markj MFC-after: 2 weeks Differential Revision: https://reviews.freebsd.org/D47138 (cherry picked from commit 40097cd67c0d52e2b288e8555b12faf02768d89c) --- sys/dev/gve/gve.h | 3 +- sys/dev/gve/gve_main.c | 4 +- sys/dev/gve/gve_sysctl.c | 6 +-- sys/dev/gve/gve_tx.c | 98 ++++++++++++++++++++++++++++++++++++++---------- sys/dev/gve/gve_tx_dqo.c | 10 +++++ 5 files changed, 95 insertions(+), 26 deletions(-) diff --git a/sys/dev/gve/gve.h b/sys/dev/gve/gve.h index 43082d64ba95..92ab6838d5bb 100644 --- a/sys/dev/gve/gve.h +++ b/sys/dev/gve/gve.h @@ -341,7 +341,7 @@ struct gve_txq_stats { counter_u64_t tpackets; counter_u64_t tso_packet_cnt; counter_u64_t tx_dropped_pkt; - counter_u64_t tx_dropped_pkt_nospace_device; + counter_u64_t tx_delayed_pkt_nospace_device; counter_u64_t tx_dropped_pkt_nospace_bufring; counter_u64_t tx_delayed_pkt_nospace_descring; counter_u64_t tx_delayed_pkt_nospace_compring; @@ -383,6 +383,7 @@ struct gve_tx_ring { struct task xmit_task; struct taskqueue *xmit_tq; + bool stopped; /* Accessed when writing descriptors */ struct buf_ring *br; diff --git a/sys/dev/gve/gve_main.c b/sys/dev/gve/gve_main.c index 5575c82f0681..e40f77b08e85 100644 --- a/sys/dev/gve/gve_main.c +++ b/sys/dev/gve/gve_main.c @@ -32,10 +32,10 @@ #include "gve_adminq.h" #include "gve_dqo.h" -#define GVE_DRIVER_VERSION "GVE-FBSD-1.3.0\n" +#define GVE_DRIVER_VERSION "GVE-FBSD-1.3.1\n" #define GVE_VERSION_MAJOR 1 #define GVE_VERSION_MINOR 3 -#define GVE_VERSION_SUB 0 +#define GVE_VERSION_SUB 1 #define GVE_DEFAULT_RX_COPYBREAK 256 diff --git a/sys/dev/gve/gve_sysctl.c b/sys/dev/gve/gve_sysctl.c index b9f2eefa5bb0..7a091d9caa43 100644 --- a/sys/dev/gve/gve_sysctl.c +++ b/sys/dev/gve/gve_sysctl.c @@ -140,9 +140,9 @@ gve_setup_txq_sysctl(struct sysctl_ctx_list *ctx, "tx_bytes", CTLFLAG_RD, &stats->tbytes, "Bytes transmitted"); SYSCTL_ADD_COUNTER_U64(ctx, tx_list, OID_AUTO, - "tx_dropped_pkt_nospace_device", CTLFLAG_RD, - &stats->tx_dropped_pkt_nospace_device, - "Packets dropped due to no space in device"); + "tx_delayed_pkt_nospace_device", CTLFLAG_RD, + &stats->tx_delayed_pkt_nospace_device, + "Packets delayed due to no space in device"); SYSCTL_ADD_COUNTER_U64(ctx, tx_list, OID_AUTO, "tx_dropped_pkt_nospace_bufring", CTLFLAG_RD, &stats->tx_dropped_pkt_nospace_bufring, diff --git a/sys/dev/gve/gve_tx.c b/sys/dev/gve/gve_tx.c index 60a54878b685..e7e10e526cb9 100644 --- a/sys/dev/gve/gve_tx.c +++ b/sys/dev/gve/gve_tx.c @@ -246,6 +246,8 @@ gve_start_tx_ring(struct gve_priv *priv, int i, struct gve_tx_ring *tx = &priv->tx[i]; struct gve_ring_com *com = &tx->com; + atomic_store_bool(&tx->stopped, false); + NET_TASK_INIT(&com->cleanup_task, 0, cleanup, tx); com->cleanup_tq = taskqueue_create_fast("gve tx", M_WAITOK, taskqueue_thread_enqueue, &com->cleanup_tq); @@ -427,6 +429,11 @@ gve_tx_cleanup_tq(void *arg, int pending) gve_db_bar_write_4(priv, tx->com.irq_db_offset, GVE_IRQ_MASK); taskqueue_enqueue(tx->com.cleanup_tq, &tx->com.cleanup_task); } + + if (atomic_load_bool(&tx->stopped) && space_freed) { + atomic_store_bool(&tx->stopped, false); + taskqueue_enqueue(tx->xmit_tq, &tx->xmit_task); + } } static void @@ -671,8 +678,7 @@ gve_xmit(struct gve_tx_ring *tx, struct mbuf *mbuf) bytes_required = gve_fifo_bytes_required(tx, first_seg_len, pkt_len); if (__predict_false(!gve_can_tx(tx, bytes_required))) { counter_enter(); - counter_u64_add_protected(tx->stats.tx_dropped_pkt_nospace_device, 1); - counter_u64_add_protected(tx->stats.tx_dropped_pkt, 1); + counter_u64_add_protected(tx->stats.tx_delayed_pkt_nospace_device, 1); counter_exit(); return (ENOBUFS); } @@ -733,6 +739,59 @@ gve_xmit(struct gve_tx_ring *tx, struct mbuf *mbuf) return (0); } +static int +gve_xmit_mbuf(struct gve_tx_ring *tx, + struct mbuf **mbuf) +{ + if (gve_is_gqi(tx->com.priv)) + return (gve_xmit(tx, *mbuf)); + + if (gve_is_qpl(tx->com.priv)) + return (gve_xmit_dqo_qpl(tx, *mbuf)); + + /* + * gve_xmit_dqo might attempt to defrag the mbuf chain. + * The reference is passed in so that in the case of + * errors, the new mbuf chain is what's put back on the br. + */ + return (gve_xmit_dqo(tx, mbuf)); +} + +/* + * Has the side-effect of stopping the xmit queue by setting tx->stopped + */ +static int +gve_xmit_retry_enobuf_mbuf(struct gve_tx_ring *tx, + struct mbuf **mbuf) +{ + int err; + + atomic_store_bool(&tx->stopped, true); + + /* + * Room made in the queue BEFORE the barrier will be seen by the + * gve_xmit_mbuf retry below. + * + * If room is made in the queue AFTER the barrier, the cleanup tq + * iteration creating the room will either see a tx->stopped value + * of 0 or the 1 we just wrote: + * + * If it sees a 1, then it would enqueue the xmit tq. Enqueue + * implies a retry on the waiting pkt. + * + * If it sees a 0, then that implies a previous iteration overwrote + * our 1, and that iteration would enqueue the xmit tq. Enqueue + * implies a retry on the waiting pkt. + */ + atomic_thread_fence_seq_cst(); + + err = gve_xmit_mbuf(tx, mbuf); + if (err == 0) + atomic_store_bool(&tx->stopped, false); + + return (err); +} + static void gve_xmit_br(struct gve_tx_ring *tx) { @@ -743,24 +802,23 @@ gve_xmit_br(struct gve_tx_ring *tx) while ((if_getdrvflags(ifp) & IFF_DRV_RUNNING) != 0 && (mbuf = drbr_peek(ifp, tx->br)) != NULL) { + err = gve_xmit_mbuf(tx, &mbuf); - if (gve_is_gqi(priv)) - err = gve_xmit(tx, mbuf); - else { - /* - * gve_xmit_dqo might attempt to defrag the mbuf chain. - * The reference is passed in so that in the case of - * errors, the new mbuf chain is what's put back on the br. - */ - if (gve_is_qpl(priv)) - err = gve_xmit_dqo_qpl(tx, mbuf); - else - err = gve_xmit_dqo(tx, &mbuf); - } + /* + * We need to stop this taskqueue when we can't xmit the pkt due + * to lack of space in the NIC ring (ENOBUFS). The retry exists + * to guard against a TOCTTOU bug that could end up freezing the + * queue forever. + */ + if (__predict_false(mbuf != NULL && err == ENOBUFS)) + err = gve_xmit_retry_enobuf_mbuf(tx, &mbuf); if (__predict_false(err != 0 && mbuf != NULL)) { - drbr_putback(ifp, tx->br, mbuf); - taskqueue_enqueue(tx->xmit_tq, &tx->xmit_task); + if (err == EINVAL) { + drbr_advance(ifp, tx->br); + m_freem(mbuf); + } else + drbr_putback(ifp, tx->br, mbuf); break; } @@ -827,7 +885,8 @@ gve_xmit_ifp(if_t ifp, struct mbuf *mbuf) is_br_empty = drbr_empty(ifp, tx->br); err = drbr_enqueue(ifp, tx->br, mbuf); if (__predict_false(err != 0)) { - taskqueue_enqueue(tx->xmit_tq, &tx->xmit_task); + if (!atomic_load_bool(&tx->stopped)) + taskqueue_enqueue(tx->xmit_tq, &tx->xmit_task); counter_enter(); counter_u64_add_protected(tx->stats.tx_dropped_pkt_nospace_bufring, 1); counter_u64_add_protected(tx->stats.tx_dropped_pkt, 1); @@ -842,9 +901,8 @@ gve_xmit_ifp(if_t ifp, struct mbuf *mbuf) if (is_br_empty && (GVE_RING_TRYLOCK(tx) != 0)) { gve_xmit_br(tx); GVE_RING_UNLOCK(tx); - } else { + } else if (!atomic_load_bool(&tx->stopped)) taskqueue_enqueue(tx->xmit_tq, &tx->xmit_task); - } return (0); } diff --git a/sys/dev/gve/gve_tx_dqo.c b/sys/dev/gve/gve_tx_dqo.c index 323c032a3e65..fab2d6d0f613 100644 --- a/sys/dev/gve/gve_tx_dqo.c +++ b/sys/dev/gve/gve_tx_dqo.c @@ -1052,6 +1052,16 @@ gve_tx_cleanup_dqo(struct gve_priv *priv, struct gve_tx_ring *tx, int budget) work_done++; } + /* + * Waking the xmit taskqueue has to occur after room has been made in + * the queue. + */ + atomic_thread_fence_seq_cst(); + if (atomic_load_bool(&tx->stopped) && work_done) { + atomic_store_bool(&tx->stopped, false); + taskqueue_enqueue(tx->xmit_tq, &tx->xmit_task); + } + tx->done += work_done; /* tx->done is just a sysctl counter */ counter_enter(); counter_u64_add_protected(tx->stats.tbytes, bytes_done);