From nobody Tue May 03 17:28:59 2022 X-Original-To: dev-commits-src-main@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id CEC201AC7356; Tue, 3 May 2022 17:28:59 +0000 (UTC) (envelope-from git@FreeBSD.org) Received: from mxrelay.nyi.freebsd.org (mxrelay.nyi.freebsd.org [IPv6:2610:1c1:1:606c::19:3]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "mxrelay.nyi.freebsd.org", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4Kt6Q340vDz3D98; Tue, 3 May 2022 17:28:59 +0000 (UTC) (envelope-from git@FreeBSD.org) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1651598939; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=zTUhJnMqmdB/6NOeULLOM7k48PVmVl6i+zBpCEM+PQs=; b=gYWCuojZfT4Aesopge/T654E0gXC/bD4n+R96Lw77EzR2UepAW7cnlyd/Y5IQM3EEqJ6VK mzUdTRris6ERU3YjU7vej0NFoA8Nv/pUPCRj2LZkN7/LBYJLo5ngcK+EEj/hKc5mVHUnMY CRhUnVq/INcerSMcppt1Ov6BVysiBVlYhAaRajQxpAZ8WwPOmRsejB4nNEwk2IykoDSZPn AFcpqnIZ1xQwcL7dfMrj2DenzMO6BHy6+RItqZW+9Xtvw44M3V5FPWKKAITXT9Ou6Z/JkY kxgJd3MN7SY6Y6o3ThtRM65icTKe0h4RMo2eORbIycwDC1qEXWYp2f8+1ihnHw== Received: from gitrepo.freebsd.org (gitrepo.freebsd.org [IPv6:2610:1c1:1:6068::e6a:5]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mxrelay.nyi.freebsd.org (Postfix) with ESMTPS id 4786222548; Tue, 3 May 2022 17:28:59 +0000 (UTC) (envelope-from git@FreeBSD.org) Received: from gitrepo.freebsd.org ([127.0.1.44]) by gitrepo.freebsd.org (8.16.1/8.16.1) with ESMTP id 243HSxse050077; Tue, 3 May 2022 17:28:59 GMT (envelope-from git@gitrepo.freebsd.org) Received: (from git@localhost) by gitrepo.freebsd.org (8.16.1/8.16.1/Submit) id 243HSxlo050076; Tue, 3 May 2022 17:28:59 GMT (envelope-from git) Date: Tue, 3 May 2022 17:28:59 GMT Message-Id: <202205031728.243HSxlo050076@gitrepo.freebsd.org> To: src-committers@FreeBSD.org, dev-commits-src-all@FreeBSD.org, dev-commits-src-main@FreeBSD.org From: Marko Zec Subject: git: d461deeaa4a4 - main - VNET: Revert "ifnet: make if_index global" List-Id: Commit messages for the main branch of the src repository List-Archive: https://lists.freebsd.org/archives/dev-commits-src-main List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-dev-commits-src-main@freebsd.org X-BeenThere: dev-commits-src-main@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Git-Committer: zec X-Git-Repository: src X-Git-Refname: refs/heads/main X-Git-Reftype: branch X-Git-Commit: d461deeaa4a47ae71e1d8fda8b35c6faa8dabe85 Auto-Submitted: auto-generated ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1651598939; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=zTUhJnMqmdB/6NOeULLOM7k48PVmVl6i+zBpCEM+PQs=; b=Ow1Ga8NVikk73Lj12qKBlHN/gNV5HBoZ3F+Vh8bJahUMJ7GhA2giw0GLhvnwjNWWqpik7N uocmmtxn3+95QmhDlzTLGcgL2wyfoG8TBgNEGqdOowpfOFpLHTs/OTyDN3sM6AzFI3yVVX gVrzpeO7EtJDrm1twEJcERRSmdh4c98FXrDuicybZRKkJws2yC4J3pvT0//FNpuLBT292r ZdNnu/1LGXw76D+RuuPy2ZrC2wBPsNdZOdy+zFuFHsRfPXBUiHM9mW4fnNx1XSkbzR3200 83Osrngx2nKfN9/Uw5Tqbn00Wi0UxhwSiCLrbY06VQVpchPHf/L055iKmeV0/Q== ARC-Seal: i=1; s=dkim; d=freebsd.org; t=1651598939; a=rsa-sha256; cv=none; b=w6a9tYoAVxEWNJJBvwzwF2ExCM+m79GBe0/6w7URFhWKvsUVUlmgkzVx3pxN1rJQMW7V0U u3ylQW1O/ZmSh6G5dpFjOoOKum8RooYsK6umr+Nj5OLRfU4/HN8b8efDFvMpOPW5MPUI2x 0zAPd6jo+9HdTbJ4dRE+lglVkeJ9TXEAWMwUssFr1bAPEWebLmUnO6cwyil/o0dTCZ+y4y YZUH2T+YLZcud3TvlATjStdebKwQ4K53zqAOLGeSIMw6bxVfIYKMoiK/+iorLQJXCbeS2s xZK0W9fhSayH3JqbX+cn+wW7hK/sef52yKsovOrKXhblI7OrMPxUQF1noalNxw== ARC-Authentication-Results: i=1; mx1.freebsd.org; none X-ThisMailContainsUnwantedMimeParts: N The branch main has been updated by zec: URL: https://cgit.FreeBSD.org/src/commit/?id=d461deeaa4a47ae71e1d8fda8b35c6faa8dabe85 commit d461deeaa4a47ae71e1d8fda8b35c6faa8dabe85 Author: Marko Zec AuthorDate: 2022-05-03 14:57:55 +0000 Commit: Marko Zec CommitDate: 2022-05-03 17:27:57 +0000 VNET: Revert "ifnet: make if_index global" This reverts commit 91f44749c6feb50f39af8805dd803e860f0418f1. Devirtualization of V_if_index and V_ifindex_table was rushed into the tree lacking proper context, discussion, and declaration of intent, so I'm backing it out as harmful to VNET on the following grounds: 1) The change repurposed the decades-old and stable if_index KBI for new, unclear goals which were omitted from the commit note. 2) The change opened up a new resource exhaustion vector where any vnet could starve the system of ifnet indices, including vnet0. 3) To circumvent the newly introduced problem of separating ifnets belonging to different vnets from the globalized ifindex_table, the author introduced sysctl_ifcount() which does a linear traversal over the (potentially huge) global ifnet list just to return a simple upper bound on existing ifnet indices. 4) The change effectively led to nonuniform ifnet index allocation among vnets. 5) The commit note clearly stated that the patch changed the implicit if_index ABI contract where ifnet indices were assumed to be starting from one. The commit note also included a correct observation that holes in interface indices were always allowed, but failed to declare that the userland-observable ifindex tables could now include huge empty spans even under modest operating conditions. 6) The author had an earlier proposal in the works which did not affect per-vnet ifnet lists (D33265) but which he abandoned without providing the rationale behind his decision to do so, at the expense of sacrificing the vnet isolation contract and if_index ABI / KBI. Furthermore, the author agreed to back out his changes himself and to follow up with a proposal for a less intrusive alternative, but later silently declined to act. Therefore, I decided to resolve the status-quo by backing this out myself. This in no way precludes a future proposal aiming to mitigate ifnet-removal related system crashes or panics to be accepted, provided it would not unnecessarily compromise the goal of as strict as possible isolation between vnets. Obtained from: github.com/glebius/FreeBSD/commits/backout-ifindex --- sys/net/if.c | 214 +++++++++++++++++++++++++++++++++++------------------------ 1 file changed, 129 insertions(+), 85 deletions(-) diff --git a/sys/net/if.c b/sys/net/if.c index 3b303fe42e99..de63b9366843 100644 --- a/sys/net/if.c +++ b/sys/net/if.c @@ -311,30 +311,19 @@ VNET_DEFINE(struct ifnethead, ifnet); /* depend on static init XXX */ VNET_DEFINE(struct ifgrouphead, ifg_head); /* Table of ifnet by index. */ -static int if_index; -static int if_indexlim = 8; -static struct ifnet **ifindex_table; +VNET_DEFINE_STATIC(int, if_index); +#define V_if_index VNET(if_index) +VNET_DEFINE_STATIC(int, if_indexlim) = 8; +#define V_if_indexlim VNET(if_indexlim) +VNET_DEFINE_STATIC(struct ifnet **, ifindex_table); +#define V_ifindex_table VNET(ifindex_table) SYSCTL_NODE(_net_link_generic, IFMIB_SYSTEM, system, CTLFLAG_RW | CTLFLAG_MPSAFE, 0, "Variables global to all interfaces"); -static int -sysctl_ifcount(SYSCTL_HANDLER_ARGS) -{ - int rv = 0; - - IFNET_RLOCK(); - for (int i = 1; i <= if_index; i++) - if (ifindex_table[i] != NULL && - ifindex_table[i]->if_vnet == curvnet) - rv = i; - IFNET_RUNLOCK(); - - return (sysctl_handle_int(oidp, &rv, 0, req)); -} -SYSCTL_PROC(_net_link_generic_system, IFMIB_IFCOUNT, ifcount, - CTLTYPE_INT | CTLFLAG_VNET | CTLFLAG_RD, NULL, 0, sysctl_ifcount, "I", - "Maximum known interface index"); +SYSCTL_INT(_net_link_generic_system, IFMIB_IFCOUNT, ifcount, + CTLFLAG_VNET | CTLFLAG_RD, &VNET_NAME(if_index), 0, + "Number of configured interfaces"); /* * The global network interface list (V_ifnet) and related state (such as @@ -363,19 +352,13 @@ MALLOC_DEFINE(M_IFMADDR, "ether_multi", "link-level multicast address"); struct ifnet * ifnet_byindex(u_int idx) { - struct ifnet *ifp; NET_EPOCH_ASSERT(); - if (__predict_false(idx > if_index)) + if (__predict_false(idx > V_if_index)) return (NULL); - ifp = ck_pr_load_ptr(&ifindex_table[idx]); - - if (curvnet != NULL && ifp != NULL && ifp->if_vnet != curvnet) - ifp = NULL; - - return (ifp); + return (ck_pr_load_ptr(&V_ifindex_table[idx])); } struct ifnet * @@ -392,20 +375,63 @@ ifnet_byindex_ref(u_int idx) } /* - * Network interface utility routines. - * - * Routines with ifa_ifwith* names take sockaddr *'s as - * parameters. + * Allocate an ifindex array entry. */ +static void +ifindex_alloc(struct ifnet *ifp) +{ + u_short idx; + + IFNET_WLOCK(); + /* + * Try to find an empty slot below V_if_index. If we fail, take the + * next slot. + */ + for (idx = 1; idx <= V_if_index; idx++) { + if (V_ifindex_table[idx] == NULL) + break; + } + + /* Catch if_index overflow. */ + if (idx >= V_if_indexlim) { + struct ifnet **new, **old; + int newlim; + + newlim = V_if_indexlim * 2; + new = malloc(newlim * sizeof(*new), M_IFNET, M_WAITOK | M_ZERO); + memcpy(new, V_ifindex_table, V_if_indexlim * sizeof(*new)); + old = V_ifindex_table; + ck_pr_store_ptr(&V_ifindex_table, new); + V_if_indexlim = newlim; + epoch_wait_preempt(net_epoch_preempt); + free(old, M_IFNET); + } + if (idx > V_if_index) + V_if_index = idx; + + ifp->if_index = idx; + ck_pr_store_ptr(&V_ifindex_table[idx], ifp); + IFNET_WUNLOCK(); +} static void -if_init(void *arg __unused) +ifindex_free(u_short idx) { - ifindex_table = malloc(if_indexlim * sizeof(*ifindex_table), - M_IFNET, M_WAITOK | M_ZERO); + IFNET_WLOCK_ASSERT(); + + ck_pr_store_ptr(&V_ifindex_table[idx], NULL); + while (V_if_index > 0 && + V_ifindex_table[V_if_index] == NULL) + V_if_index--; } -SYSINIT(if_init, SI_SUB_INIT_IF, SI_ORDER_SECOND, if_init, NULL); + +/* + * Network interface utility routines. + * + * Routines with ifa_ifwith* names take sockaddr *'s as + * parameters. + */ static void vnet_if_init(const void *unused __unused) @@ -413,11 +439,29 @@ vnet_if_init(const void *unused __unused) CK_STAILQ_INIT(&V_ifnet); CK_STAILQ_INIT(&V_ifg_head); + V_ifindex_table = malloc(V_if_indexlim * sizeof(*V_ifindex_table), + M_IFNET, M_WAITOK | M_ZERO); vnet_if_clone_init(); } VNET_SYSINIT(vnet_if_init, SI_SUB_INIT_IF, SI_ORDER_SECOND, vnet_if_init, NULL); +#ifdef VIMAGE +static void +vnet_if_uninit(const void *unused __unused) +{ + + VNET_ASSERT(CK_STAILQ_EMPTY(&V_ifnet), ("%s:%d tailq &V_ifnet=%p " + "not empty", __func__, __LINE__, &V_ifnet)); + VNET_ASSERT(CK_STAILQ_EMPTY(&V_ifg_head), ("%s:%d tailq &V_ifg_head=%p " + "not empty", __func__, __LINE__, &V_ifg_head)); + + free((caddr_t)V_ifindex_table, M_IFNET); +} +VNET_SYSUNINIT(vnet_if_uninit, SI_SUB_INIT_IF, SI_ORDER_FIRST, + vnet_if_uninit, NULL); +#endif + static void if_link_ifnet(struct ifnet *ifp) { @@ -510,7 +554,6 @@ static struct ifnet * if_alloc_domain(u_char type, int numa_domain) { struct ifnet *ifp; - u_short idx; KASSERT(numa_domain <= IF_NODOM, ("numa_domain too large")); if (numa_domain == IF_NODOM) @@ -550,37 +593,7 @@ if_alloc_domain(u_char type, int numa_domain) ifp->if_get_counter = if_get_counter_default; ifp->if_pcp = IFNET_PCP_NONE; - /* Allocate an ifindex array entry. */ - IFNET_WLOCK(); - /* - * Try to find an empty slot below if_index. If we fail, take the - * next slot. - */ - for (idx = 1; idx <= if_index; idx++) { - if (ifindex_table[idx] == NULL) - break; - } - - /* Catch if_index overflow. */ - if (idx >= if_indexlim) { - struct ifnet **new, **old; - int newlim; - - newlim = if_indexlim * 2; - new = malloc(newlim * sizeof(*new), M_IFNET, M_WAITOK | M_ZERO); - memcpy(new, ifindex_table, if_indexlim * sizeof(*new)); - old = ifindex_table; - ck_pr_store_ptr(&ifindex_table, new); - if_indexlim = newlim; - epoch_wait_preempt(net_epoch_preempt); - free(old, M_IFNET); - } - if (idx > if_index) - if_index = idx; - - ifp->if_index = idx; - ck_pr_store_ptr(&ifindex_table[idx], ifp); - IFNET_WUNLOCK(); + ifindex_alloc(ifp); return (ifp); } @@ -650,18 +663,23 @@ if_free(struct ifnet *ifp) * epoch and then dereferencing ifp while we perform if_free(), * and after if_free() finished, too. * - * This early index freeing was important back when ifindex was - * virtualized and interface would outlive the vnet. + * The reason is the VIMAGE. For some reason it was designed + * to require all sockets drained before destroying, but not all + * ifnets. A vnet destruction calls if_vmove() on ifnet, which + * causes ID change. But ID change and a possible misidentification + * of an ifnet later is a lesser problem, as it doesn't crash kernel. + * A worse problem is that removed interface may outlive the vnet it + * belongs too! The if_free_deferred() would see ifp->if_vnet freed. */ + CURVNET_SET_QUIET(ifp->if_vnet); IFNET_WLOCK(); - MPASS(ifindex_table[ifp->if_index] == ifp); - ck_pr_store_ptr(&ifindex_table[ifp->if_index], NULL); - while (if_index > 0 && ifindex_table[if_index] == NULL) - if_index--; + MPASS(V_ifindex_table[ifp->if_index] == ifp); + ifindex_free(ifp->if_index); IFNET_WUNLOCK(); if (refcount_release(&ifp->if_refcount)) NET_EPOCH_CALL(if_free_deferred, &ifp->if_epoch_ctx); + CURVNET_RESTORE(); } /* @@ -805,7 +823,7 @@ if_attach_internal(struct ifnet *ifp, bool vmove) struct sockaddr_dl *sdl; struct ifaddr *ifa; - MPASS(ifindex_table[ifp->if_index] == ifp); + MPASS(V_ifindex_table[ifp->if_index] == ifp); #ifdef VIMAGE ifp->if_vnet = curvnet; @@ -1255,6 +1273,17 @@ if_vmove(struct ifnet *ifp, struct vnet *new_vnet) if (rc != 0) return (rc); + /* + * Unlink the ifnet from ifindex_table[] in current vnet, and shrink + * the if_index for that vnet if possible. + * + * NOTE: IFNET_WLOCK/IFNET_WUNLOCK() are assumed to be unvirtualized, + * or we'd lock on one vnet and unlock on another. + */ + IFNET_WLOCK(); + ifindex_free(ifp->if_index); + IFNET_WUNLOCK(); + /* * Perform interface-specific reassignment tasks, if provided by * the driver. @@ -1266,6 +1295,7 @@ if_vmove(struct ifnet *ifp, struct vnet *new_vnet) * Switch to the context of the target vnet. */ CURVNET_SET_QUIET(new_vnet); + ifindex_alloc(ifp); if_attach_internal(ifp, true); #ifdef DEV_BPF @@ -1901,6 +1931,7 @@ ifa_ifwithnet(const struct sockaddr *addr, int ignore_ptp, int fibnum) struct ifaddr *ifa_maybe = NULL; u_int af = addr->sa_family; const char *addr_data = addr->sa_data, *cplim; + const struct sockaddr_dl *sdl; NET_EPOCH_ASSERT(); /* @@ -1908,9 +1939,14 @@ ifa_ifwithnet(const struct sockaddr *addr, int ignore_ptp, int fibnum) * so do that if we can. */ if (af == AF_LINK) { - ifp = ifnet_byindex( - ((const struct sockaddr_dl *)addr)->sdl_index); - return (ifp ? ifp->if_addr : NULL); + sdl = (const struct sockaddr_dl *)addr; + if (sdl->sdl_index && sdl->sdl_index <= V_if_index) { + ifp = ifnet_byindex(sdl->sdl_index); + if (ifp == NULL) + return (NULL); + + return (ifp->if_addr); + } } /* @@ -4546,16 +4582,24 @@ DB_SHOW_COMMAND(ifnet, db_show_ifnet) DB_SHOW_ALL_COMMAND(ifnets, db_show_all_ifnets) { + VNET_ITERATOR_DECL(vnet_iter); struct ifnet *ifp; u_short idx; - for (idx = 1; idx <= if_index; idx++) { - ifp = ifindex_table[idx]; - if (ifp == NULL) - continue; - db_printf( "%20s ifp=%p\n", ifp->if_xname, ifp); - if (db_pager_quit) - break; + VNET_FOREACH(vnet_iter) { + CURVNET_SET_QUIET(vnet_iter); +#ifdef VIMAGE + db_printf("vnet=%p\n", curvnet); +#endif + for (idx = 1; idx <= V_if_index; idx++) { + ifp = V_ifindex_table[idx]; + if (ifp == NULL) + continue; + db_printf( "%20s ifp=%p\n", ifp->if_xname, ifp); + if (db_pager_quit) + break; + } + CURVNET_RESTORE(); } } #endif /* DDB */