Re: per-FIB socket binding

From: Paul Vixie <paul_at_redbarn.org>
Date: Sun, 12 Jan 2025 07:17:48 UTC
On Saturday, January 11, 2025 4:51:07 PM UTC Mark Johnston wrote:
> On Sat, Jan 11, 2025 at 06:25:22AM +0000, Paul Vixie wrote:
> > On Monday, January 6, 2025 3:56:55 PM UTC Mark Johnston wrote:
> > > On Fri, Dec 27, 2024 at 08:48:48AM +0000, Paul Vixie wrote:
> > ...
> > 
> > 	x = y || z;
> > 
> > so if this is forbidden by today's freebsd kernel rules, please educate
> > me. i know that GCC and CLang will optimize down to the same instruction
> > sequence for either, but i prefer the shorter form since in this rare
> > case it is clearer.
> 
> We are not super consistent about it, but style(9) does prescribe
> explicit comparisons, i.e., "if (count != 0)" rather than "if (count)".
> In any case, I'd add a comment, since that assignment is a bit subtle.

i'll add this comment:

/* ref. ISO/IEC 9899:2023 § 6.5.15 */

...and let the reviewers sort it out.

> > ... the SYN|ACK will always use the FIB from the interface where
> > the SYN arrived (this is in tcp_syncache.c).
> 
> This isn't clear to me.  The initial SYN will create a syncache entry in the
> 
>   if (tp->t_state == TCPS_LISTEN && SOLISTENING(so))
> 
> case in tcp_input_with_port().  In this case we define inc.inc_fibnum =
> so->so_fibnum, i.e., the FIB of the listening socket.  Then,
> syncache_add() copies inc to sc_inc, so sc_inc.inc_fibnum for the
> syncache entry comes from the listening socket, and syncache_respond()
> sets the SYN|ACK mbuf FIB with M_SETFIB(m, sc->sc_inc.inc_fibnum).  What
> am I missing?  (Yes, I should actually do an experiement to check the
> behaviour.)

this is exactly my understanding of that code. the FIB for the SYN|ACK will be 
the one from the interface who received the SYN. it's only later after we 
receive the ACK that completes the 3-way handshake that we will begin to use 
the FIB of the socket. in terms of your earlier question, this means if we're 
going to drop a SYN because it came in on the wrong interface (some form of 
uRPF) then the firewall module (PF or IPFW) will be doing that to the SYN well 
before a syncache entry is ever created. this seems to imply that the changes 
i'm proposing won't be able to permit connections to complete that would have 
not have completed -- which you correctly predicted could be surprising.

the kernel implementations of uRPF-like features are unaffected.

                /*
                 * net.inet.ip.rfc1122_strong_es: the address matches, verify
                 * that the packet arrived via the correct interface.
                 */
                if (__predict_false(strong_es && ia->ia_ifp != ifp)) {
                        IPSTAT_INC(ips_badaddr);
                        goto bad;
                }

we're not changing the inbound packet's FIB.

                /*
                 * net.inet.ip.source_address_validation: drop incoming
                 * packets that pretend to be ours.
                 */
                if (V_ip_sav && !(ifp->if_flags & IFF_LOOPBACK) &&
                    __predict_false(in_localip_fib(ip->ip_src, ifp->if_fib)))                                         
	{
                        IPSTAT_INC(ips_badaddr);
                        goto bad;
                }

unless i misunderstand, this only rejects packets if the source address is the 
same as one of the addresses of the interface it arrives on, whereas the 
documentation made me expect that a source address which was any address of 
any interface would be the trigger. i think my proposal doesn't change this.

therefore my focus was on the verrevpath, versrcreach, and antispoof features 
of IPFW; and the antispoof feature of PF. in those tools, the inbound SYN 
would be stopped before it ever reached the SYN cache.

what i then observed is that the only change resulting from my proposal would 
be to the transmission of subsequent segments, which would use the interface's 
FIB (if nonzero) rather than the socket's FIB (if zero). if upstream uRPF 
would have dropped these segments because they're coming from the wrong place, 
then indeed failure would result, and even FIN and RST could not be 
transmitted, which would mean the socket would die a slow death by eventual 
timeout. my proposal would prevent this, but i think any surprise in this case 
would be a welcome one. if actual segments could not be transmitted because 
the socket's FIB did not contain a route back to the connection's source, then 
failure would be more immediate but the resulting surprise (if any) no less 
positive.

i hope that if that analysis is correct no-one will demand a socket option or 
sysctl. to be effective at fixing path symmetry for multihoming, this logic 
has to be the default.

---

having finished with accept(), i'm now testing the bind() and bindat() 
changes, which ought to be effective for any UDP responder who uses a separate 
socket for each interface address it speaks through, which certainly includes 
all NTP and DNS servers i'm aware of, and may also help QUIC.

i've begun to wish for a setfib_fd(2) which would let sshd promote the FIB 
from an inbound connection to be process-wide (after fork() before exec().) 
that way "netstat -rn" would show the routing table actually being used by the 
stdin and stdout of that shell. or perhaps we could just rename SO_SETFIB to 
be SO_FIB and allow getsockopt() to return the FIB? (i'm not sure why it was 
"set only" in the current implementation.) opinions welcomed, as before. 

-- 
Paul Vixie