From nobody Thu May 05 15:34:10 2022 X-Original-To: dev-commits-src-all@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id D0B191AB9117 for ; Thu, 5 May 2022 15:34:23 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: from mail-vk1-xa2e.google.com (mail-vk1-xa2e.google.com [IPv6:2607:f8b0:4864:20::a2e]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4KvHmt3N8lz3lfS for ; Thu, 5 May 2022 15:34:22 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: by mail-vk1-xa2e.google.com with SMTP id d132so2293296vke.0 for ; Thu, 05 May 2022 08:34:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bsdimp-com.20210112.gappssmtp.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=0yhZeJu5j0rfWq+Xe3aIyI8H82khhKCs0elvUq7c6gc=; b=RdTqIVXtnmlhEfyCSkq7kHg31R80bHiLm01qvat0UD7hW9xPCOCcCPdasb/XxqBDiX eU0tpnzCPbXO/PzF7AvSmYrQfwQx3i6Bs4GSRipZKapu0+kNt5fagBzhpXZ73iAxx8UL monO+mKYLSn4lMWjXigZecEbSsieG1t1ky+e2m8LlsSdNFATX7HySsAdS89Gp3NsYEzN L28dUQkHcyGgT1vuTA3FDlyTsOyEfvRJivZGdwOejEBcRKSsbLt7cn25WZtdE+XovL55 kqgbpve/BjroBJI7aZSz3wkR52FsjgMy2eyo70pPwNQUp8wqPQy0ZkFwZ0Us2nT4hPqQ NawA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=0yhZeJu5j0rfWq+Xe3aIyI8H82khhKCs0elvUq7c6gc=; b=joZ3ldgIK0jBebN5hhtcR8b0aTr021pb6Ey/sfR1t/xEF6vzECRv6b2Y+mc7lhEbp8 D6nfo2ACqj5bfm7m6kM6Cnu8HJ3n4b9hGhSM6ZcJq0nglLR0W+0O0LdPA0QtbxZo6/ol stSPuCkYyGhVP1EDX8Egmvf/Lrg+Fhm7cPcwgW3Ppfw2Z0QQpv+4YOXjjgrPgrI7c+VO 6pZqBk0mG9TYszedzE5MZ9LXT9nEc0IVzZPG+XTX4dKdhPI96ELgpURKIr+lWW3l/cMp u9Ngn7CEbo/Jsaeu5uPlOchcYsZGOXRLnXfv7Wh28O+r8JE0nPmw3jCekNn8lk5paqRq O7TQ== X-Gm-Message-State: AOAM532MOFA326vPOmT4Qi2WHp/fRRVFqoOuAA6p/CeIxYuc6FBQGfVC /IV79p4lvUJ5HWd5i8l7Q8OiwI9EvHVBhPwXEjQeHA== X-Google-Smtp-Source: ABdhPJxchwa0rYjKBiUeuOLyOq99P05h0JL/GTPDLjZ+Pd7/JVoJ6L30tzlGlwSoQ0GBEzP1TXip7nZJubcJTWvtcXw= X-Received: by 2002:a1f:ce46:0:b0:34e:b018:c8a4 with SMTP id e67-20020a1fce46000000b0034eb018c8a4mr6929029vkg.26.1651764861826; Thu, 05 May 2022 08:34:21 -0700 (PDT) List-Id: Commit messages for all branches of the src repository List-Archive: https://lists.freebsd.org/archives/dev-commits-src-all List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-dev-commits-src-all@freebsd.org X-BeenThere: dev-commits-src-all@freebsd.org MIME-Version: 1.0 References: <202205031728.243HSxlo050076@gitrepo.freebsd.org> In-Reply-To: <202205031728.243HSxlo050076@gitrepo.freebsd.org> From: Warner Losh Date: Thu, 5 May 2022 09:34:10 -0600 Message-ID: Subject: Re: git: d461deeaa4a4 - main - VNET: Revert "ifnet: make if_index global" To: Marko Zec Cc: src-committers , "" , dev-commits-src-main@freebsd.org Content-Type: multipart/alternative; boundary="000000000000d78ade05de4579d1" X-Rspamd-Queue-Id: 4KvHmt3N8lz3lfS X-Spamd-Bar: -- Authentication-Results: mx1.freebsd.org; dkim=pass header.d=bsdimp-com.20210112.gappssmtp.com header.s=20210112 header.b=RdTqIVXt; dmarc=none; spf=none (mx1.freebsd.org: domain of wlosh@bsdimp.com has no SPF policy when checking 2607:f8b0:4864:20::a2e) smtp.mailfrom=wlosh@bsdimp.com X-Spamd-Result: default: False [-3.00 / 15.00]; ARC_NA(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; R_DKIM_ALLOW(-0.20)[bsdimp-com.20210112.gappssmtp.com:s=20210112]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[4]; TO_DN_SOME(0.00)[]; NEURAL_HAM_LONG(-1.00)[-0.999]; MIME_GOOD(-0.10)[multipart/alternative,text/plain]; PREVIOUSLY_DELIVERED(0.00)[dev-commits-src-all@freebsd.org]; DMARC_NA(0.00)[bsdimp.com]; TO_MATCH_ENVRCPT_SOME(0.00)[]; DKIM_TRACE(0.00)[bsdimp-com.20210112.gappssmtp.com:+]; NEURAL_HAM_SHORT(-1.00)[-1.000]; RCVD_IN_DNSWL_NONE(0.00)[2607:f8b0:4864:20::a2e:from]; MLMMJ_DEST(0.00)[dev-commits-src-all]; FORGED_SENDER(0.30)[imp@bsdimp.com,wlosh@bsdimp.com]; R_SPF_NA(0.00)[no SPF record]; MIME_TRACE(0.00)[0:+,1:+,2:~]; ASN(0.00)[asn:15169, ipnet:2607:f8b0::/32, country:US]; FROM_NEQ_ENVFROM(0.00)[imp@bsdimp.com,wlosh@bsdimp.com]; RCVD_TLS_ALL(0.00)[]; RCVD_COUNT_TWO(0.00)[2] X-ThisMailContainsUnwantedMimeParts: N --000000000000d78ade05de4579d1 Content-Type: text/plain; charset="UTF-8" Marko, There were ongoing discussions as to what to do about the issues around VNETs when you made this unilateral commit. The discussions stopped shortly after. The discussions had been productively headed towards identification of the problems with the tree, what steps needed to be taken and how best to proceed. The original commits had followed the project's processes and gave everybody a chance to comment before they were committed. Only after the fact did you raise an objection, and now many months have passed, making this backout untimely. In addition, after this series of commits, the tree was so functionally broken that all VNET tests had to be disabled. This represents, in our view, an unacceptably large blast radius to our testing infrastructure. We believe that the changes needed to make Gleb's work acceptable will be much smaller than redoing that work as well. We don't think that Gleb's work is fundamentally flawed. Given these circumstances, core views this series of commits as premature and will be replaying Gleb's commits and re-enabling tests. We expect Marko, Gleb and everybody else to work together towards the goal of identifying actionable issues that remain in the tree after our restoration and to work in good faith to resolve them. In addition, should people believe that Gleb's direction is wrong and they want to start with the state prior to the commit, they are welcome to develop their own solution to present to the community with arguments about why it is better, and then follow the normal collaberative process to win people over to that method. Warner and Kyle on behalf of the core team On Tue, May 3, 2022 at 11:29 AM Marko Zec wrote: > The branch main has been updated by zec: > > URL: > https://cgit.FreeBSD.org/src/commit/?id=d461deeaa4a47ae71e1d8fda8b35c6faa8dabe85 > > commit d461deeaa4a47ae71e1d8fda8b35c6faa8dabe85 > Author: Marko Zec > AuthorDate: 2022-05-03 14:57:55 +0000 > Commit: Marko Zec > CommitDate: 2022-05-03 17:27:57 +0000 > > VNET: Revert "ifnet: make if_index global" > > This reverts commit 91f44749c6feb50f39af8805dd803e860f0418f1. > > Devirtualization of V_if_index and V_ifindex_table was rushed into > the tree lacking proper context, discussion, and declaration of intent, > so I'm backing it out as harmful to VNET on the following grounds: > > 1) The change repurposed the decades-old and stable if_index KBI for > new, unclear goals which were omitted from the commit note. > > 2) The change opened up a new resource exhaustion vector where any vnet > could starve the system of ifnet indices, including vnet0. > > 3) To circumvent the newly introduced problem of separating ifnets > belonging to different vnets from the globalized ifindex_table, the > author introduced sysctl_ifcount() which does a linear traversal over > the (potentially huge) global ifnet list just to return a simple upper > bound on existing ifnet indices. > > 4) The change effectively led to nonuniform ifnet index allocation > among vnets. > > 5) The commit note clearly stated that the patch changed the implicit > if_index ABI contract where ifnet indices were assumed to be starting > from one. The commit note also included a correct observation that > holes in interface indices were always allowed, but failed to declare > that the userland-observable ifindex tables could now include huge > empty spans even under modest operating conditions. > > 6) The author had an earlier proposal in the works which did not > affect per-vnet ifnet lists (D33265) but which he abandoned without > providing the rationale behind his decision to do so, at the expense > of sacrificing the vnet isolation contract and if_index ABI / KBI. > > Furthermore, the author agreed to back out his changes himself and > to follow up with a proposal for a less intrusive alternative, but > later silently declined to act. Therefore, I decided to resolve the > status-quo by backing this out myself. This in no way precludes a > future proposal aiming to mitigate ifnet-removal related system > crashes or panics to be accepted, provided it would not unnecessarily > compromise the goal of as strict as possible isolation between vnets. > > Obtained from: github.com/glebius/FreeBSD/commits/backout-ifindex > --- > sys/net/if.c | 214 > +++++++++++++++++++++++++++++++++++------------------------ > 1 file changed, 129 insertions(+), 85 deletions(-) > > diff --git a/sys/net/if.c b/sys/net/if.c > index 3b303fe42e99..de63b9366843 100644 > --- a/sys/net/if.c > +++ b/sys/net/if.c > @@ -311,30 +311,19 @@ VNET_DEFINE(struct ifnethead, ifnet); /* depend > on static init XXX */ > VNET_DEFINE(struct ifgrouphead, ifg_head); > > /* Table of ifnet by index. */ > -static int if_index; > -static int if_indexlim = 8; > -static struct ifnet **ifindex_table; > +VNET_DEFINE_STATIC(int, if_index); > +#define V_if_index VNET(if_index) > +VNET_DEFINE_STATIC(int, if_indexlim) = 8; > +#define V_if_indexlim VNET(if_indexlim) > +VNET_DEFINE_STATIC(struct ifnet **, ifindex_table); > +#define V_ifindex_table VNET(ifindex_table) > > SYSCTL_NODE(_net_link_generic, IFMIB_SYSTEM, system, > CTLFLAG_RW | CTLFLAG_MPSAFE, 0, > "Variables global to all interfaces"); > -static int > -sysctl_ifcount(SYSCTL_HANDLER_ARGS) > -{ > - int rv = 0; > - > - IFNET_RLOCK(); > - for (int i = 1; i <= if_index; i++) > - if (ifindex_table[i] != NULL && > - ifindex_table[i]->if_vnet == curvnet) > - rv = i; > - IFNET_RUNLOCK(); > - > - return (sysctl_handle_int(oidp, &rv, 0, req)); > -} > -SYSCTL_PROC(_net_link_generic_system, IFMIB_IFCOUNT, ifcount, > - CTLTYPE_INT | CTLFLAG_VNET | CTLFLAG_RD, NULL, 0, sysctl_ifcount, "I", > - "Maximum known interface index"); > +SYSCTL_INT(_net_link_generic_system, IFMIB_IFCOUNT, ifcount, > + CTLFLAG_VNET | CTLFLAG_RD, &VNET_NAME(if_index), 0, > + "Number of configured interfaces"); > > /* > * The global network interface list (V_ifnet) and related state (such as > @@ -363,19 +352,13 @@ MALLOC_DEFINE(M_IFMADDR, "ether_multi", "link-level > multicast address"); > struct ifnet * > ifnet_byindex(u_int idx) > { > - struct ifnet *ifp; > > NET_EPOCH_ASSERT(); > > - if (__predict_false(idx > if_index)) > + if (__predict_false(idx > V_if_index)) > return (NULL); > > - ifp = ck_pr_load_ptr(&ifindex_table[idx]); > - > - if (curvnet != NULL && ifp != NULL && ifp->if_vnet != curvnet) > - ifp = NULL; > - > - return (ifp); > + return (ck_pr_load_ptr(&V_ifindex_table[idx])); > } > > struct ifnet * > @@ -392,20 +375,63 @@ ifnet_byindex_ref(u_int idx) > } > > /* > - * Network interface utility routines. > - * > - * Routines with ifa_ifwith* names take sockaddr *'s as > - * parameters. > + * Allocate an ifindex array entry. > */ > +static void > +ifindex_alloc(struct ifnet *ifp) > +{ > + u_short idx; > + > + IFNET_WLOCK(); > + /* > + * Try to find an empty slot below V_if_index. If we fail, take > the > + * next slot. > + */ > + for (idx = 1; idx <= V_if_index; idx++) { > + if (V_ifindex_table[idx] == NULL) > + break; > + } > + > + /* Catch if_index overflow. */ > + if (idx >= V_if_indexlim) { > + struct ifnet **new, **old; > + int newlim; > + > + newlim = V_if_indexlim * 2; > + new = malloc(newlim * sizeof(*new), M_IFNET, M_WAITOK | > M_ZERO); > + memcpy(new, V_ifindex_table, V_if_indexlim * sizeof(*new)); > + old = V_ifindex_table; > + ck_pr_store_ptr(&V_ifindex_table, new); > + V_if_indexlim = newlim; > + epoch_wait_preempt(net_epoch_preempt); > + free(old, M_IFNET); > + } > + if (idx > V_if_index) > + V_if_index = idx; > + > + ifp->if_index = idx; > + ck_pr_store_ptr(&V_ifindex_table[idx], ifp); > + IFNET_WUNLOCK(); > +} > > static void > -if_init(void *arg __unused) > +ifindex_free(u_short idx) > { > > - ifindex_table = malloc(if_indexlim * sizeof(*ifindex_table), > - M_IFNET, M_WAITOK | M_ZERO); > + IFNET_WLOCK_ASSERT(); > + > + ck_pr_store_ptr(&V_ifindex_table[idx], NULL); > + while (V_if_index > 0 && > + V_ifindex_table[V_if_index] == NULL) > + V_if_index--; > } > -SYSINIT(if_init, SI_SUB_INIT_IF, SI_ORDER_SECOND, if_init, NULL); > + > +/* > + * Network interface utility routines. > + * > + * Routines with ifa_ifwith* names take sockaddr *'s as > + * parameters. > + */ > > static void > vnet_if_init(const void *unused __unused) > @@ -413,11 +439,29 @@ vnet_if_init(const void *unused __unused) > > CK_STAILQ_INIT(&V_ifnet); > CK_STAILQ_INIT(&V_ifg_head); > + V_ifindex_table = malloc(V_if_indexlim * sizeof(*V_ifindex_table), > + M_IFNET, M_WAITOK | M_ZERO); > vnet_if_clone_init(); > } > VNET_SYSINIT(vnet_if_init, SI_SUB_INIT_IF, SI_ORDER_SECOND, vnet_if_init, > NULL); > > +#ifdef VIMAGE > +static void > +vnet_if_uninit(const void *unused __unused) > +{ > + > + VNET_ASSERT(CK_STAILQ_EMPTY(&V_ifnet), ("%s:%d tailq &V_ifnet=%p " > + "not empty", __func__, __LINE__, &V_ifnet)); > + VNET_ASSERT(CK_STAILQ_EMPTY(&V_ifg_head), ("%s:%d tailq > &V_ifg_head=%p " > + "not empty", __func__, __LINE__, &V_ifg_head)); > + > + free((caddr_t)V_ifindex_table, M_IFNET); > +} > +VNET_SYSUNINIT(vnet_if_uninit, SI_SUB_INIT_IF, SI_ORDER_FIRST, > + vnet_if_uninit, NULL); > +#endif > + > static void > if_link_ifnet(struct ifnet *ifp) > { > @@ -510,7 +554,6 @@ static struct ifnet * > if_alloc_domain(u_char type, int numa_domain) > { > struct ifnet *ifp; > - u_short idx; > > KASSERT(numa_domain <= IF_NODOM, ("numa_domain too large")); > if (numa_domain == IF_NODOM) > @@ -550,37 +593,7 @@ if_alloc_domain(u_char type, int numa_domain) > ifp->if_get_counter = if_get_counter_default; > ifp->if_pcp = IFNET_PCP_NONE; > > - /* Allocate an ifindex array entry. */ > - IFNET_WLOCK(); > - /* > - * Try to find an empty slot below if_index. If we fail, take the > - * next slot. > - */ > - for (idx = 1; idx <= if_index; idx++) { > - if (ifindex_table[idx] == NULL) > - break; > - } > - > - /* Catch if_index overflow. */ > - if (idx >= if_indexlim) { > - struct ifnet **new, **old; > - int newlim; > - > - newlim = if_indexlim * 2; > - new = malloc(newlim * sizeof(*new), M_IFNET, M_WAITOK | > M_ZERO); > - memcpy(new, ifindex_table, if_indexlim * sizeof(*new)); > - old = ifindex_table; > - ck_pr_store_ptr(&ifindex_table, new); > - if_indexlim = newlim; > - epoch_wait_preempt(net_epoch_preempt); > - free(old, M_IFNET); > - } > - if (idx > if_index) > - if_index = idx; > - > - ifp->if_index = idx; > - ck_pr_store_ptr(&ifindex_table[idx], ifp); > - IFNET_WUNLOCK(); > + ifindex_alloc(ifp); > > return (ifp); > } > @@ -650,18 +663,23 @@ if_free(struct ifnet *ifp) > * epoch and then dereferencing ifp while we perform if_free(), > * and after if_free() finished, too. > * > - * This early index freeing was important back when ifindex was > - * virtualized and interface would outlive the vnet. > + * The reason is the VIMAGE. For some reason it was designed > + * to require all sockets drained before destroying, but not all > + * ifnets. A vnet destruction calls if_vmove() on ifnet, which > + * causes ID change. But ID change and a possible > misidentification > + * of an ifnet later is a lesser problem, as it doesn't crash > kernel. > + * A worse problem is that removed interface may outlive the vnet > it > + * belongs too! The if_free_deferred() would see ifp->if_vnet > freed. > */ > + CURVNET_SET_QUIET(ifp->if_vnet); > IFNET_WLOCK(); > - MPASS(ifindex_table[ifp->if_index] == ifp); > - ck_pr_store_ptr(&ifindex_table[ifp->if_index], NULL); > - while (if_index > 0 && ifindex_table[if_index] == NULL) > - if_index--; > + MPASS(V_ifindex_table[ifp->if_index] == ifp); > + ifindex_free(ifp->if_index); > IFNET_WUNLOCK(); > > if (refcount_release(&ifp->if_refcount)) > NET_EPOCH_CALL(if_free_deferred, &ifp->if_epoch_ctx); > + CURVNET_RESTORE(); > } > > /* > @@ -805,7 +823,7 @@ if_attach_internal(struct ifnet *ifp, bool vmove) > struct sockaddr_dl *sdl; > struct ifaddr *ifa; > > - MPASS(ifindex_table[ifp->if_index] == ifp); > + MPASS(V_ifindex_table[ifp->if_index] == ifp); > > #ifdef VIMAGE > ifp->if_vnet = curvnet; > @@ -1255,6 +1273,17 @@ if_vmove(struct ifnet *ifp, struct vnet *new_vnet) > if (rc != 0) > return (rc); > > + /* > + * Unlink the ifnet from ifindex_table[] in current vnet, and > shrink > + * the if_index for that vnet if possible. > + * > + * NOTE: IFNET_WLOCK/IFNET_WUNLOCK() are assumed to be > unvirtualized, > + * or we'd lock on one vnet and unlock on another. > + */ > + IFNET_WLOCK(); > + ifindex_free(ifp->if_index); > + IFNET_WUNLOCK(); > + > /* > * Perform interface-specific reassignment tasks, if provided by > * the driver. > @@ -1266,6 +1295,7 @@ if_vmove(struct ifnet *ifp, struct vnet *new_vnet) > * Switch to the context of the target vnet. > */ > CURVNET_SET_QUIET(new_vnet); > + ifindex_alloc(ifp); > if_attach_internal(ifp, true); > > #ifdef DEV_BPF > @@ -1901,6 +1931,7 @@ ifa_ifwithnet(const struct sockaddr *addr, int > ignore_ptp, int fibnum) > struct ifaddr *ifa_maybe = NULL; > u_int af = addr->sa_family; > const char *addr_data = addr->sa_data, *cplim; > + const struct sockaddr_dl *sdl; > > NET_EPOCH_ASSERT(); > /* > @@ -1908,9 +1939,14 @@ ifa_ifwithnet(const struct sockaddr *addr, int > ignore_ptp, int fibnum) > * so do that if we can. > */ > if (af == AF_LINK) { > - ifp = ifnet_byindex( > - ((const struct sockaddr_dl *)addr)->sdl_index); > - return (ifp ? ifp->if_addr : NULL); > + sdl = (const struct sockaddr_dl *)addr; > + if (sdl->sdl_index && sdl->sdl_index <= V_if_index) { > + ifp = ifnet_byindex(sdl->sdl_index); > + if (ifp == NULL) > + return (NULL); > + > + return (ifp->if_addr); > + } > } > > /* > @@ -4546,16 +4582,24 @@ DB_SHOW_COMMAND(ifnet, db_show_ifnet) > > DB_SHOW_ALL_COMMAND(ifnets, db_show_all_ifnets) > { > + VNET_ITERATOR_DECL(vnet_iter); > struct ifnet *ifp; > u_short idx; > > - for (idx = 1; idx <= if_index; idx++) { > - ifp = ifindex_table[idx]; > - if (ifp == NULL) > - continue; > - db_printf( "%20s ifp=%p\n", ifp->if_xname, ifp); > - if (db_pager_quit) > - break; > + VNET_FOREACH(vnet_iter) { > + CURVNET_SET_QUIET(vnet_iter); > +#ifdef VIMAGE > + db_printf("vnet=%p\n", curvnet); > +#endif > + for (idx = 1; idx <= V_if_index; idx++) { > + ifp = V_ifindex_table[idx]; > + if (ifp == NULL) > + continue; > + db_printf( "%20s ifp=%p\n", ifp->if_xname, ifp); > + if (db_pager_quit) > + break; > + } > + CURVNET_RESTORE(); > } > } > #endif /* DDB */ > --000000000000d78ade05de4579d1 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Marko,

There were ongoing discussions as to what to= do about the issues around VNETs when you made this unilateral commit. The= discussions stopped shortly after. The discussions had been productively h= eaded towards identification of the problems with the tree, what steps need= ed to be taken and how best to proceed. The original commits had followed t= he project's processes and gave everybody a chance to comment before th= ey were committed. Only after the fact did you raise an objection, and now = many months have passed, making this backout untimely.

In addition, = after this series of commits, the tree was so functionally broken that all = VNET tests had to be disabled. This represents, in our view, an unacceptabl= y large blast radius to our testing infrastructure. We believe that the cha= nges needed to make Gleb's work acceptable will be much smaller than re= doing that work as well. We don't think that Gleb's work is fundame= ntally flawed.

Given these circumstances, core views this series of = commits as premature =C2=A0and will be replaying Gleb's commits and re-= enabling tests. We expect Marko, Gleb and everybody else to work together t= owards the goal of identifying actionable issues that remain in the tree af= ter our restoration and to work in good faith to resolve them. In addition,= should people believe that Gleb's direction is wrong and they want to = start with the state prior to the commit, they are welcome to develop their= own solution to present to the community with arguments about why it is be= tter, and then follow the normal collaberative process to win people over t= o that method.

Warner and Kyle on behalf of the core= team

On Tue, May 3, 2022 at 11:29 AM Marko Zec <zec@freebsd.org> wrote:
The branch main has been updated by zec:
URL: https://cgit.= FreeBSD.org/src/commit/?id=3Dd461deeaa4a47ae71e1d8fda8b35c6faa8dabe85
commit d461deeaa4a47ae71e1d8fda8b35c6faa8dabe85
Author:=C2=A0 =C2=A0 =C2=A0Marko Zec <zec@FreeBSD.org>
AuthorDate: 2022-05-03 14:57:55 +0000
Commit:=C2=A0 =C2=A0 =C2=A0Marko Zec <zec@FreeBSD.org>
CommitDate: 2022-05-03 17:27:57 +0000

=C2=A0 =C2=A0 VNET: Revert "ifnet: make if_index global"

=C2=A0 =C2=A0 This reverts commit 91f44749c6feb50f39af8805dd803e860f0418f1.=

=C2=A0 =C2=A0 Devirtualization of V_if_index and V_ifindex_table was rushed= into
=C2=A0 =C2=A0 the tree lacking proper context, discussion, and declaration = of intent,
=C2=A0 =C2=A0 so I'm backing it out as harmful to VNET on the following= grounds:

=C2=A0 =C2=A0 1) The change repurposed the decades-old and stable if_index = KBI for
=C2=A0 =C2=A0 new, unclear goals which were omitted from the commit note.
=C2=A0 =C2=A0 2) The change opened up a new resource exhaustion vector wher= e any vnet
=C2=A0 =C2=A0 could starve the system of ifnet indices, including vnet0.
=C2=A0 =C2=A0 3) To circumvent the newly introduced problem of separating i= fnets
=C2=A0 =C2=A0 belonging to different vnets from the globalized ifindex_tabl= e, the
=C2=A0 =C2=A0 author introduced sysctl_ifcount() which does a linear traver= sal over
=C2=A0 =C2=A0 the (potentially huge) global ifnet list just to return a sim= ple upper
=C2=A0 =C2=A0 bound on existing ifnet indices.

=C2=A0 =C2=A0 4) The change effectively led to nonuniform ifnet index alloc= ation
=C2=A0 =C2=A0 among vnets.

=C2=A0 =C2=A0 5) The commit note clearly stated that the patch changed the = implicit
=C2=A0 =C2=A0 if_index ABI contract where ifnet indices were assumed to be = starting
=C2=A0 =C2=A0 from one.=C2=A0 The commit note also included a correct obser= vation that
=C2=A0 =C2=A0 holes in interface indices were always allowed, but failed to= declare
=C2=A0 =C2=A0 that the userland-observable ifindex tables could now include= huge
=C2=A0 =C2=A0 empty spans even under modest operating conditions.

=C2=A0 =C2=A0 6) The author had an earlier proposal in the works which did = not
=C2=A0 =C2=A0 affect per-vnet ifnet lists (D33265) but which he abandoned w= ithout
=C2=A0 =C2=A0 providing the rationale behind his decision to do so, at the = expense
=C2=A0 =C2=A0 of sacrificing the vnet isolation contract and if_index ABI /= KBI.

=C2=A0 =C2=A0 Furthermore, the author agreed to back out his changes himsel= f and
=C2=A0 =C2=A0 to follow up with a proposal for a less intrusive alternative= , but
=C2=A0 =C2=A0 later silently declined to act.=C2=A0 Therefore, I decided to= resolve the
=C2=A0 =C2=A0 status-quo by backing this out myself.=C2=A0 This in no way p= recludes a
=C2=A0 =C2=A0 future proposal aiming to mitigate ifnet-removal related syst= em
=C2=A0 =C2=A0 crashes or panics to be accepted, provided it would not unnec= essarily
=C2=A0 =C2=A0 compromise the goal of as strict as possible isolation betwee= n vnets.

=C2=A0 =C2=A0 Obtained from: github.com/gle= bius/FreeBSD/commits/backout-ifindex
---
=C2=A0sys/net/if.c | 214 +++++++++++++++++++++++++++++++++++---------------= ---------
=C2=A01 file changed, 129 insertions(+), 85 deletions(-)

diff --git a/sys/net/if.c b/sys/net/if.c
index 3b303fe42e99..de63b9366843 100644
--- a/sys/net/if.c
+++ b/sys/net/if.c
@@ -311,30 +311,19 @@ VNET_DEFINE(struct ifnethead, ifnet);=C2=A0 =C2=A0 = =C2=A0/* depend on static init XXX */
=C2=A0VNET_DEFINE(struct ifgrouphead, ifg_head);

=C2=A0/* Table of ifnet by index. */
-static int if_index;
-static int if_indexlim =3D 8;
-static struct ifnet **ifindex_table;
+VNET_DEFINE_STATIC(int, if_index);
+#define=C2=A0 =C2=A0 =C2=A0 =C2=A0 V_if_index=C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 VNET(if_index)
+VNET_DEFINE_STATIC(int, if_indexlim) =3D 8;
+#define=C2=A0 =C2=A0 =C2=A0 =C2=A0 V_if_indexlim=C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0VNET(if_indexlim)
+VNET_DEFINE_STATIC(struct ifnet **, ifindex_table);
+#define=C2=A0 =C2=A0 =C2=A0 =C2=A0 V_ifindex_table=C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0VNET(ifindex_table)

=C2=A0SYSCTL_NODE(_net_link_generic, IFMIB_SYSTEM, system,
=C2=A0 =C2=A0 =C2=A0CTLFLAG_RW | CTLFLAG_MPSAFE, 0,
=C2=A0 =C2=A0 =C2=A0"Variables global to all interfaces");
-static int
-sysctl_ifcount(SYSCTL_HANDLER_ARGS)
-{
-=C2=A0 =C2=A0 =C2=A0 =C2=A0int rv =3D 0;
-
-=C2=A0 =C2=A0 =C2=A0 =C2=A0IFNET_RLOCK();
-=C2=A0 =C2=A0 =C2=A0 =C2=A0for (int i =3D 1; i <=3D if_index; i++)
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (ifindex_table[i= ] !=3D NULL &&
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0ifind= ex_table[i]->if_vnet =3D=3D curvnet)
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0rv =3D i;
-=C2=A0 =C2=A0 =C2=A0 =C2=A0IFNET_RUNLOCK();
-
-=C2=A0 =C2=A0 =C2=A0 =C2=A0return (sysctl_handle_int(oidp, &rv, 0, req= ));
-}
-SYSCTL_PROC(_net_link_generic_system, IFMIB_IFCOUNT, ifcount,
-=C2=A0 =C2=A0 CTLTYPE_INT | CTLFLAG_VNET | CTLFLAG_RD, NULL, 0, sysctl_ifc= ount, "I",
-=C2=A0 =C2=A0 "Maximum known interface index");
+SYSCTL_INT(_net_link_generic_system, IFMIB_IFCOUNT, ifcount,
+=C2=A0 =C2=A0 CTLFLAG_VNET | CTLFLAG_RD, &VNET_NAME(if_index), 0,
+=C2=A0 =C2=A0 "Number of configured interfaces");

=C2=A0/*
=C2=A0 * The global network interface list (V_ifnet) and related state (suc= h as
@@ -363,19 +352,13 @@ MALLOC_DEFINE(M_IFMADDR, "ether_multi", &qu= ot;link-level multicast address");
=C2=A0struct ifnet *
=C2=A0ifnet_byindex(u_int idx)
=C2=A0{
-=C2=A0 =C2=A0 =C2=A0 =C2=A0struct ifnet *ifp;

=C2=A0 =C2=A0 =C2=A0 =C2=A0 NET_EPOCH_ASSERT();

-=C2=A0 =C2=A0 =C2=A0 =C2=A0if (__predict_false(idx > if_index))
+=C2=A0 =C2=A0 =C2=A0 =C2=A0if (__predict_false(idx > V_if_index))
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 return (NULL);

-=C2=A0 =C2=A0 =C2=A0 =C2=A0ifp =3D ck_pr_load_ptr(&ifindex_table[idx])= ;
-
-=C2=A0 =C2=A0 =C2=A0 =C2=A0if (curvnet !=3D NULL && ifp !=3D NULL = && ifp->if_vnet !=3D curvnet)
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0ifp =3D NULL;
-
-=C2=A0 =C2=A0 =C2=A0 =C2=A0return (ifp);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0return (ck_pr_load_ptr(&V_ifindex_table[idx= ]));
=C2=A0}

=C2=A0struct ifnet *
@@ -392,20 +375,63 @@ ifnet_byindex_ref(u_int idx)
=C2=A0}

=C2=A0/*
- * Network interface utility routines.
- *
- * Routines with ifa_ifwith* names take sockaddr *'s as
- * parameters.
+ * Allocate an ifindex array entry.
=C2=A0 */
+static void
+ifindex_alloc(struct ifnet *ifp)
+{
+=C2=A0 =C2=A0 =C2=A0 =C2=A0u_short idx;
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0IFNET_WLOCK();
+=C2=A0 =C2=A0 =C2=A0 =C2=A0/*
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 * Try to find an empty slot below V_if_index.= =C2=A0 If we fail, take the
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 * next slot.
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 */
+=C2=A0 =C2=A0 =C2=A0 =C2=A0for (idx =3D 1; idx <=3D V_if_index; idx++) = {
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (V_ifindex_table= [idx] =3D=3D NULL)
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0break;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0}
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0/* Catch if_index overflow. */
+=C2=A0 =C2=A0 =C2=A0 =C2=A0if (idx >=3D V_if_indexlim) {
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0struct ifnet **new,= **old;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0int newlim;
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0newlim =3D V_if_ind= exlim * 2;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0new =3D malloc(newl= im * sizeof(*new), M_IFNET, M_WAITOK | M_ZERO);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0memcpy(new, V_ifind= ex_table, V_if_indexlim * sizeof(*new));
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0old =3D V_ifindex_t= able;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0ck_pr_store_ptr(&am= p;V_ifindex_table, new);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0V_if_indexlim =3D n= ewlim;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0epoch_wait_preempt(= net_epoch_preempt);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0free(old, M_IFNET);=
+=C2=A0 =C2=A0 =C2=A0 =C2=A0}
+=C2=A0 =C2=A0 =C2=A0 =C2=A0if (idx > V_if_index)
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0V_if_index =3D idx;=
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0ifp->if_index =3D idx;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0ck_pr_store_ptr(&V_ifindex_table[idx], ifp)= ;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0IFNET_WUNLOCK();
+}

=C2=A0static void
-if_init(void *arg __unused)
+ifindex_free(u_short idx)
=C2=A0{

-=C2=A0 =C2=A0 =C2=A0 =C2=A0ifindex_table =3D malloc(if_indexlim * sizeof(*= ifindex_table),
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0M_IFNET, M_WAITOK | M_ZERO);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0IFNET_WLOCK_ASSERT();
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0ck_pr_store_ptr(&V_ifindex_table[idx], NULL= );
+=C2=A0 =C2=A0 =C2=A0 =C2=A0while (V_if_index > 0 &&
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0V_ifindex_table[V_if_index] =3D= =3D NULL)
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0V_if_index--;
=C2=A0}
-SYSINIT(if_init, SI_SUB_INIT_IF, SI_ORDER_SECOND, if_init, NULL);
+
+/*
+ * Network interface utility routines.
+ *
+ * Routines with ifa_ifwith* names take sockaddr *'s as
+ * parameters.
+ */

=C2=A0static void
=C2=A0vnet_if_init(const void *unused __unused)
@@ -413,11 +439,29 @@ vnet_if_init(const void *unused __unused)

=C2=A0 =C2=A0 =C2=A0 =C2=A0 CK_STAILQ_INIT(&V_ifnet);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 CK_STAILQ_INIT(&V_ifg_head);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0V_ifindex_table =3D malloc(V_if_indexlim * size= of(*V_ifindex_table),
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0M_IFNET, M_WAITOK | M_ZERO);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 vnet_if_clone_init();
=C2=A0}
=C2=A0VNET_SYSINIT(vnet_if_init, SI_SUB_INIT_IF, SI_ORDER_SECOND, vnet_if_i= nit,
=C2=A0 =C2=A0 =C2=A0NULL);

+#ifdef VIMAGE
+static void
+vnet_if_uninit(const void *unused __unused)
+{
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0VNET_ASSERT(CK_STAILQ_EMPTY(&V_ifnet), (&qu= ot;%s:%d tailq &V_ifnet=3D%p "
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0"not empty", __func__, = __LINE__, &V_ifnet));
+=C2=A0 =C2=A0 =C2=A0 =C2=A0VNET_ASSERT(CK_STAILQ_EMPTY(&V_ifg_head), (= "%s:%d tailq &V_ifg_head=3D%p "
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0"not empty", __func__, = __LINE__, &V_ifg_head));
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0free((caddr_t)V_ifindex_table, M_IFNET);
+}
+VNET_SYSUNINIT(vnet_if_uninit, SI_SUB_INIT_IF, SI_ORDER_FIRST,
+=C2=A0 =C2=A0 vnet_if_uninit, NULL);
+#endif
+
=C2=A0static void
=C2=A0if_link_ifnet(struct ifnet *ifp)
=C2=A0{
@@ -510,7 +554,6 @@ static struct ifnet *
=C2=A0if_alloc_domain(u_char type, int numa_domain)
=C2=A0{
=C2=A0 =C2=A0 =C2=A0 =C2=A0 struct ifnet *ifp;
-=C2=A0 =C2=A0 =C2=A0 =C2=A0u_short idx;

=C2=A0 =C2=A0 =C2=A0 =C2=A0 KASSERT(numa_domain <=3D IF_NODOM, ("nu= ma_domain too large"));
=C2=A0 =C2=A0 =C2=A0 =C2=A0 if (numa_domain =3D=3D IF_NODOM)
@@ -550,37 +593,7 @@ if_alloc_domain(u_char type, int numa_domain)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 ifp->if_get_counter =3D if_get_counter_defau= lt;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 ifp->if_pcp =3D IFNET_PCP_NONE;

-=C2=A0 =C2=A0 =C2=A0 =C2=A0/* Allocate an ifindex array entry. */
-=C2=A0 =C2=A0 =C2=A0 =C2=A0IFNET_WLOCK();
-=C2=A0 =C2=A0 =C2=A0 =C2=A0/*
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 * Try to find an empty slot below if_index.=C2= =A0 If we fail, take the
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 * next slot.
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 */
-=C2=A0 =C2=A0 =C2=A0 =C2=A0for (idx =3D 1; idx <=3D if_index; idx++) {<= br> -=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (ifindex_table[i= dx] =3D=3D NULL)
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0break;
-=C2=A0 =C2=A0 =C2=A0 =C2=A0}
-
-=C2=A0 =C2=A0 =C2=A0 =C2=A0/* Catch if_index overflow. */
-=C2=A0 =C2=A0 =C2=A0 =C2=A0if (idx >=3D if_indexlim) {
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0struct ifnet **new,= **old;
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0int newlim;
-
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0newlim =3D if_index= lim * 2;
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0new =3D malloc(newl= im * sizeof(*new), M_IFNET, M_WAITOK | M_ZERO);
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0memcpy(new, ifindex= _table, if_indexlim * sizeof(*new));
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0old =3D ifindex_tab= le;
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0ck_pr_store_ptr(&am= p;ifindex_table, new);
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if_indexlim =3D new= lim;
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0epoch_wait_preempt(= net_epoch_preempt);
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0free(old, M_IFNET);=
-=C2=A0 =C2=A0 =C2=A0 =C2=A0}
-=C2=A0 =C2=A0 =C2=A0 =C2=A0if (idx > if_index)
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if_index =3D idx; -
-=C2=A0 =C2=A0 =C2=A0 =C2=A0ifp->if_index =3D idx;
-=C2=A0 =C2=A0 =C2=A0 =C2=A0ck_pr_store_ptr(&ifindex_table[idx], ifp);<= br> -=C2=A0 =C2=A0 =C2=A0 =C2=A0IFNET_WUNLOCK();
+=C2=A0 =C2=A0 =C2=A0 =C2=A0ifindex_alloc(ifp);

=C2=A0 =C2=A0 =C2=A0 =C2=A0 return (ifp);
=C2=A0}
@@ -650,18 +663,23 @@ if_free(struct ifnet *ifp)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* epoch and then dereferencing ifp while = we perform if_free(),
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* and after if_free() finished, too.
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0*
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 * This early index freeing was important back = when ifindex was
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 * virtualized and interface would outlive the = vnet.
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 * The reason is the VIMAGE.=C2=A0 For some rea= son it was designed
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 * to require all sockets drained before destro= ying, but not all
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 * ifnets.=C2=A0 A vnet destruction calls if_vm= ove() on ifnet, which
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 * causes ID change.=C2=A0 But ID change and a = possible misidentification
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 * of an ifnet later is a lesser problem, as it= doesn't crash kernel.
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 * A worse problem is that removed interface ma= y outlive the vnet it
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 * belongs too!=C2=A0 The if_free_deferred() wo= uld see ifp->if_vnet freed.
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0*/
+=C2=A0 =C2=A0 =C2=A0 =C2=A0CURVNET_SET_QUIET(ifp->if_vnet);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 IFNET_WLOCK();
-=C2=A0 =C2=A0 =C2=A0 =C2=A0MPASS(ifindex_table[ifp->if_index] =3D=3D if= p);
-=C2=A0 =C2=A0 =C2=A0 =C2=A0ck_pr_store_ptr(&ifindex_table[ifp->if_i= ndex], NULL);
-=C2=A0 =C2=A0 =C2=A0 =C2=A0while (if_index > 0 && ifindex_table= [if_index] =3D=3D NULL)
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if_index--;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0MPASS(V_ifindex_table[ifp->if_index] =3D=3D = ifp);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0ifindex_free(ifp->if_index);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 IFNET_WUNLOCK();

=C2=A0 =C2=A0 =C2=A0 =C2=A0 if (refcount_release(&ifp->if_refcount))=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 NET_EPOCH_CALL(if_f= ree_deferred, &ifp->if_epoch_ctx);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0CURVNET_RESTORE();
=C2=A0}

=C2=A0/*
@@ -805,7 +823,7 @@ if_attach_internal(struct ifnet *ifp, bool vmove)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 struct sockaddr_dl *sdl;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 struct ifaddr *ifa;

-=C2=A0 =C2=A0 =C2=A0 =C2=A0MPASS(ifindex_table[ifp->if_index] =3D=3D if= p);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0MPASS(V_ifindex_table[ifp->if_index] =3D=3D = ifp);

=C2=A0#ifdef VIMAGE
=C2=A0 =C2=A0 =C2=A0 =C2=A0 ifp->if_vnet =3D curvnet;
@@ -1255,6 +1273,17 @@ if_vmove(struct ifnet *ifp, struct vnet *new_vnet) =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (rc !=3D 0)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 return (rc);

+=C2=A0 =C2=A0 =C2=A0 =C2=A0/*
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 * Unlink the ifnet from ifindex_table[] in cur= rent vnet, and shrink
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 * the if_index for that vnet if possible.
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 *
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 * NOTE: IFNET_WLOCK/IFNET_WUNLOCK() are assume= d to be unvirtualized,
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 * or we'd lock on one vnet and unlock on a= nother.
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 */
+=C2=A0 =C2=A0 =C2=A0 =C2=A0IFNET_WLOCK();
+=C2=A0 =C2=A0 =C2=A0 =C2=A0ifindex_free(ifp->if_index);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0IFNET_WUNLOCK();
+
=C2=A0 =C2=A0 =C2=A0 =C2=A0 /*
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* Perform interface-specific reassignment= tasks, if provided by
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* the driver.
@@ -1266,6 +1295,7 @@ if_vmove(struct ifnet *ifp, struct vnet *new_vnet) =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* Switch to the context of the target vne= t.
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0*/
=C2=A0 =C2=A0 =C2=A0 =C2=A0 CURVNET_SET_QUIET(new_vnet);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0ifindex_alloc(ifp);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 if_attach_internal(ifp, true);

=C2=A0#ifdef DEV_BPF
@@ -1901,6 +1931,7 @@ ifa_ifwithnet(const struct sockaddr *addr, int ignore= _ptp, int fibnum)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 struct ifaddr *ifa_maybe =3D NULL;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 u_int af =3D addr->sa_family;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 const char *addr_data =3D addr->sa_data, *cp= lim;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0const struct sockaddr_dl *sdl;

=C2=A0 =C2=A0 =C2=A0 =C2=A0 NET_EPOCH_ASSERT();
=C2=A0 =C2=A0 =C2=A0 =C2=A0 /*
@@ -1908,9 +1939,14 @@ ifa_ifwithnet(const struct sockaddr *addr, int ignor= e_ptp, int fibnum)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* so do that if we can.
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0*/
=C2=A0 =C2=A0 =C2=A0 =C2=A0 if (af =3D=3D AF_LINK) {
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0ifp =3D ifnet_byind= ex(
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0((con= st struct sockaddr_dl *)addr)->sdl_index);
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return (ifp ? ifp-&= gt;if_addr : NULL);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0sdl =3D (const stru= ct sockaddr_dl *)addr;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (sdl->sdl_ind= ex && sdl->sdl_index <=3D V_if_index) {
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0ifp =3D ifnet_byindex(sdl->sdl_index);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0if (ifp =3D=3D NULL)
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return (NULL);
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0return (ifp->if_addr);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0}
=C2=A0 =C2=A0 =C2=A0 =C2=A0 }

=C2=A0 =C2=A0 =C2=A0 =C2=A0 /*
@@ -4546,16 +4582,24 @@ DB_SHOW_COMMAND(ifnet, db_show_ifnet)

=C2=A0DB_SHOW_ALL_COMMAND(ifnets, db_show_all_ifnets)
=C2=A0{
+=C2=A0 =C2=A0 =C2=A0 =C2=A0VNET_ITERATOR_DECL(vnet_iter);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 struct ifnet *ifp;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 u_short idx;

-=C2=A0 =C2=A0 =C2=A0 =C2=A0for (idx =3D 1; idx <=3D if_index; idx++) {<= br> -=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0ifp =3D ifindex_tab= le[idx];
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (ifp =3D=3D NULL= )
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0continue;
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0db_printf( "%2= 0s ifp=3D%p\n", ifp->if_xname, ifp);
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (db_pager_quit)<= br> -=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0break;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0VNET_FOREACH(vnet_iter) {
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0CURVNET_SET_QUIET(v= net_iter);
+#ifdef VIMAGE
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0db_printf("vne= t=3D%p\n", curvnet);
+#endif
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0for (idx =3D 1; idx= <=3D V_if_index; idx++) {
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0ifp =3D V_ifindex_table[idx];
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0if (ifp =3D=3D NULL)
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0continue;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0db_printf( "%20s ifp=3D%p\n", ifp->if_xname, ifp); +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0if (db_pager_quit)
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0break;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0}
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0CURVNET_RESTORE();<= br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 }
=C2=A0}
=C2=A0#endif /* DDB */
--000000000000d78ade05de4579d1--