possible regression handling packet fragmentation in 14.0 with tftp/pxe

From: Gerrit Kühn <gerrit.kuehn_at_aei.mpg.de>
Date: Fri, 19 Apr 2024 13:39:51 UTC
Hello,

I have found something that looks like a regression to me (but it may also
be a bugfix, and I was just relying on the bug earlier :-). Anyway, I
don't fully understand what is going on, maybe someone here has more
insight than I do.

I have various router appliances based on FreeBSD. They act as
NAT-routers, dns/dhcp-servers and vpn-servers (using tinc in switch mode as
vpn solution). I use these in different incarnations for many years now
(since 8.something afaicr), the systems work fine up to 13.3. With 14.0 I
hit a strange issue:
Some of my LANs that FreeBSD is acting as NAT-gateway for (using pf for
nat, including scrubbing) contain diskless machines that need to boot off a
NFS-server that is located outside the LAN. To make this possible, The
router and the NFS-server run a tinc-connection. On the router, tinc's
virtual TAP-interface is bridged with the physical interface of the LAN:

---
bridge0: flags=1008843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST,LOWER_UP>
metric 0 mtu 1500 options=0
        ether 58:9c:fc:10:ff:ed
        id 00:00:00:00:00:00 priority 32768 hellotime 2 fwddelay 15
        maxage 20 holdcnt 6 proto rstp maxaddr 2000 timeout 1200
        root id 00:00:00:00:00:00 priority 32768 ifcost 0 port 0
        member: tap0 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
                ifmaxaddr 0 port 7 priority 128 path cost 2000000
        member: ix3 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
                ifmaxaddr 0 port 4 priority 128 path cost 2000
        groups: bridge
        nd6 options=9<PERFORMNUD,IFDISABLED>
---

The remote server runs both nfsd for the diskless root and tftpd for
PXE-booting. This was working fine up to 13.3. However, with the router
under 14.0, the first step of the tftp-part (delivering pxelinux.0 from
the syslinux package) fails and ends up in timeouts.

For the following:
192.168.130.3 is the diskless client trying to boot (Linux)
192.168.130.253 is the server for nfsroot and tftp (FreeBSD)
192.168.130.254 is the router and dhcp-server (FreeBSD 13.3/14.0)

The tftpd-server logs the follwoing events for this in /var/log/xferlog
when the client tries to boot via pxe:

---
Apr 19 11:37:40 192.168.130.253 tftpd[49562]: Filename: 'pxelinux.0'
Apr 19 11:37:40 192.168.130.253 tftpd[49562]: Mode: 'octet'
Apr 19 11:37:40 192.168.130.253 tftpd[49564]: Filename: 'pxelinux.0'
Apr 19 11:37:40 192.168.130.253 tftpd[49564]: Mode: 'octet'
Apr 19 11:37:40 192.168.130.253 tftpd[49564]: 192.168.130.3: read request
for //pxelinux.0: success
Apr 19 11:37:45 192.168.130.253 tftpd[49564]: receive_packet:
timeout
Apr 19 11:37:45 192.168.130.253 tftpd[49564]: Timeout #0 on ACK 1
Apr 19 11:37:50 192.168.130.253 tftpd[49564]: receive_packet: timeout
Apr 19 11:37:50 192.168.130.253 tftpd[49564]: Timeout #1 on ACK 1
Apr 19 11:37:55 192.168.130.253 tftpd[49564]: receive_packet: timeout
Apr 19 11:37:55 192.168.130.253 tftpd[49564]: Timeout #2 on ACK 1
Apr 19 11:38:00 192.168.130.253 tftpd[49564]: receive_packet: timeout
Apr 19 11:38:00 192.168.130.253 tftpd[49564]: Timeout #3 on ACK 1
Apr 19 11:38:05 192.168.130.253 tftpd[49564]: receive_packet: timeout
Apr 19 11:38:05 192.168.130.253 tftpd[49564]: Timeout #4 on ACK 1
Apr 19 11:38:10 192.168.130.253 tftpd[49564]: receive_packet: timeout
Apr 19 11:38:10 192.168.130.253 tftpd[49564]: Timeout #5 send ACK 1 giving
up
---

A tcpdump for the MAC of the pxe client taken on the physical interface of
the router looks like this:

---
11:37:36.843770 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request
from 00:25:90:69:bf:ae, length 548
11:37:36.844639 IP 192.168.130.254.67 > 255.255.255.255.68: BOOTP/DHCP,
Reply, length 357
11:37:40.853302 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request
from 00:25:90:69:bf:ae, length 548
11:37:40.855024 IP 192.168.130.254.67 > 255.255.255.255.68: BOOTP/DHCP,
Reply, length 357
11:37:40.855653 ARP, Request who-has 192.168.130.253 tell 192.168.130.3,
length 46
11:37:40.856543 ARP, Reply 192.168.130.253 is-at 00:bd:df:ce:fa:03, length
28
11:37:40.856584 IP 192.168.130.3.2070 > 192.168.130.253.69: TFTP, length
27, RRQ "pxelinux.0" octet tsize 0
11:37:40.860701 IP 192.168.130.253.38476 > 192.168.130.3.2070: UDP, length
14
11:37:40.860737 IP 192.168.130.3.2070 > 192.168.130.253.38476: UDP, length
17
11:37:40.860908 IP 192.168.130.3.2071 > 192.168.130.253.69: TFTP, length
32, RRQ "pxelinux.0" octet blksize 1456
11:37:40.891419 IP 192.168.130.253.31448 > 192.168.130.3.2071: UDP, length
15
11:37:40.891455 IP 192.168.130.3.2071 > 192.168.130.253.31448: UDP, length
4
11:37:40.910020 IP 192.168.130.253.31448 > 192.168.130.3.2071: UDP, length
1460
11:37:40.910037 IP 192.168.130.253 > 192.168.130.3: ip-proto-17
11:37:45.910310 IP 192.168.130.253.31448 > 192.168.130.3.2071: UDP, length
1460
11:37:45.910327 IP 192.168.130.253 > 192.168.130.3: ip-proto-17
11:37:50.915422 IP 192.168.130.253.31448 > 192.168.130.3.2071: UDP, length
1460
11:37:50.915439 IP 192.168.130.253 > 192.168.130.3: ip-proto-17
11:37:55.919340 IP 192.168.130.253.31448 > 192.168.130.3.2071: UDP, length
1460
11:37:55.919359 IP 192.168.130.253 > 192.168.130.3: ip-proto-17
11:38:00.934017 IP 192.168.130.253.31448 > 192.168.130.3.2071: UDP, length
1460
11:38:00.934033 IP 192.168.130.253 > 192.168.130.3: ip-proto-17
11:38:05.943631 IP 192.168.130.253.31448 > 192.168.130.3.2071: UDP, length
1460
11:38:05.943651 IP 192.168.130.253 > 192.168.130.3: ip-proto-17
---

It looks like there are tftp packages transmitted that are somehow never
picked up by the client. As 13.3 was running fine in this place, I compared
the tcpdump output to what is happening there:

---
13:34:34.112855 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP,
Request from 00:25:90:69:bf:ae (oui Unknown), length 548
13:34:36.145073 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP,
Request from 00:25:90:69:bf:ae (oui Unknown), length 548
13:34:40.154596 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP,
Request from 00:25:90:69:bf:ae (oui Unknown), length 548
13:34:40.155930 ARP, Request who-has 192.168.130.253 tell 192.168.130.3,
length 46
13:34:40.156176 ARP, Reply 192.168.130.253 is-at 00:bd:7b:2d:f7:05 (oui
Unknown), length 28
13:34:40.156239 IP 192.168.130.3.2070 > 192.168.130.253.tftp:  27 RRQ
"pxelinux.0" octet tsize 0
13:34:40.159338 IP 192.168.130.253.16697 > 192.168.130.3.2070: UDP, length
14
13:34:40.159406 IP 192.168.130.3.2070 > 192.168.130.253.16697: UDP, length
17
13:34:40.159574 IP 192.168.130.3.2071 > 192.168.130.253.tftp:  32 RRQ
"pxelinux.0" octet blksize 1456
13:34:40.162327 IP 192.168.130.253.33393 > 192.168.130.3.2071: UDP, length
15
13:34:40.162388 IP 192.168.130.3.2071 > 192.168.130.253.33393: UDP, length
4
13:34:40.162708 IP 192.168.130.253.33393 > 192.168.130.3.2071: UDP, bad
length 1460 > 1392
13:34:40.162758 IP 192.168.130.253 > 192.168.130.3: udp
13:34:40.162837 IP 192.168.130.3.2071 > 192.168.130.253.33393: UDP, length
4
13:34:40.163089 IP 192.168.130.253.33393 > 192.168.130.3.2071: UDP, bad
length 1460 > 1392
13:34:40.163124 IP 192.168.130.253 > 192.168.130.3: udp
13:34:40.163670 IP 192.168.130.3.2071 > 192.168.130.253.33393: UDP, length
4
13:34:40.163920 IP 192.168.130.253.33393 > 192.168.130.3.2071: UDP, bad
length 1460 > 1392
13:34:40.163956 IP 192.168.130.253 > 192.168.130.3: udp
13:34:40.164515 IP 192.168.130.3.2071 > 192.168.130.253.33393: UDP, length
4
13:34:40.164765 IP 192.168.130.253.33393 > 192.168.130.3.2071: UDP, bad
length 1460 > 1392
[...]
---


Although this reports "bad length" all the time (whatever this means), it
works and transfers bootloader, initramfs, kernel etc. for diskless Linux
machines in the LAN.
But this suspiciously looked like MTU problems. The VPN only offers an MTU
of 1425 by default, while tftp appears to use 1460. After some
searching and reading I found that the original tftp default was 512 byte
packets, and the client obviously requests larger packets for speed
reasons explicitely with the "blksize 1456" command. Unfortunately, I found
no way to configure the PXE firmware to use smaller packets.
However, adding the "-o" option to FreeBSD's tftpd could disable all extra
options and forced both the server and the client to user smaller packets.
TFTP and PXE-booting were working fine again after that change.

On the other hand, this feels like a workaround. What is the actual
problem here, and why did the very same setup "just work" up to FreeBSD
13.3 on the router? The setup of pf.conf is quite minimal, the packet
normalization part is just
---
set block-policy return
set optimization aggressive
scrub in all
---

Is this some kind of regression or rather the fix of a bug I was relying
upon earlier? Any hints and insight would be greatly appreciated.


cu
  Gerrit