Re: Kernel DHCP unpredictable/fails (PXE boot), userspace DHCP works just fine

From: Attila Nagy <nagy.attila_at_gmail.com>
Date: Thu, 16 Mar 2023 21:26:40 UTC
Hey,

Sure. We're talking about 30 machines, all behave the same (either bad or
good). I'm pretty sure it's not a cabling issue. :)

Yves Guérin <yvesguerin@yahoo.ca> ezt írta (időpont: 2023. márc. 16., Cs,
22:14):

> Dear Attila,
>
> May be I will add some noise to your thread, sorry in advance, I am just a
> sysadmin and I faced the same problem with one of my old hp g7 the network
> card was broken (malfunctionning) , sometime it works and sometime not when
> I used pxe and dhcpd (take to much time to answer to the dhcp so the
> motherboard decided to reboot, etc. (infinite loop)).  The card works
> perfectly when it's setup by an OS.
>
> May be it's a stupid question or two: do you check the network cable ?  (I
> faced some defective cables and it ruin my day...) in the same way what
> about the hub/router attached to this server (configuration, etc.), Do you
> switched a good one by a bad one ? (same network cable, hub/router, etc.)
>
> I spend too much nights in the lab...
>
> Regards,
>
> Yves Guerin
>
>
> Le jeudi 16 mars 2023 à 16:44:49 UTC−4, Attila Nagy <nagy.attila@gmail.com>
> a écrit :
>
>
> Hi,
>
> As this is super annoying, I'm willing to pay a $500 bounty for solving
> this issue (whomever is first, however I don't anticipate a big competition
> :) Having an invoice would be best, but I'm willing to accept individuals
> as well).
> I can't give remote access, but can run debug builds with serial console.
> stable/13 branch.
>
> I have a bunch of netbooted machines, one set in a cluster is older (HP
> DL80 G9, 2x8C, Intel I350 -igb- NICs), the other set is newer (HP XL225n
> G10, AMD EPYC2x16C, BCM57412 -bnxt- NICs).
> All of these boot from the network, which is basically:
> - get IP and options with DHCP with the help of the NIC's PXE stack
> - get the loader and kernel, start it
> - do another round of DHCP from the kernel (bootp_subr.c)
> - mount the root via NFS and let everything work as usual
>
> The problem is that the newer machines take an indefinite time to boot.
> The older ones (with igb NIC) work reliably, they always boot fast.
> The process of getting an IP address via DHCP (bootpc_call from
> bootp_subr.c) either succeeds normally (in a few seconds), or takes a lot
> of time.
> Common (measured) times to boot range from 10s of minutes to anywhere
> between a few hours (1-6).
> Sometimes it just gets stuck and couldn't get past bootpc_call (getting
> the DHCP lease).
>
> What I've already tried:
> - we have a redundant set of DHCP servers which offer static leases (so
> there are two DHCPOFFERs), so I tried to turn off one of them, nothing has
> changed
> - tried to disable SMP, the effect is the same
> - tried to see whether it's a network issue. The NIC's PXE stack always
> gets the lease quickly and booting FreeBSD from an ISO and issuing dhclient
> on the same interface is also fast. After the machines have booted, there
> are no network issues, they work reliably (since more than a year for 20+
> machines, so not just a few hours)
>
> This issue wasn't so bad previously (only a few mins to tens of minutes
> delay), but recently it got pretty unbearable, even making some machines
> unbootable for days...
>
> First I thought it might be a packet loss (or more exactly packet delivery
> from the DHCP server to the receiving socket), either in the network or in
> the NIC/kernel itself, so I placed a few random printfs into bootp_subr.c
> and udp_usrreq.c.
>
> After spending some time trying to understand the problem it feels like a
> race condition in
> bootpc_call, but I don't know the code well enough to effectively verify
> that.
>
> Here are the modified bootp_subr.c and udp_usrreq.c:
>
> https://gist.githubusercontent.com/bra-fsn/128ae9a3bbc0dbdbb2f6f4b3e2c5157a/raw/a8ade8af252f618c84a46da2452d557ebc5078ac/bootp_subr.c
>
> https://gist.github.com/bra-fsn/128ae9a3bbc0dbdbb2f6f4b3e2c5157a/raw/a8ade8af252f618c84a46da2452d557ebc5078ac/udp_usrreq.c
> (modified from stable/13 branch from a few weeks earlier)
>
> This is the output with the always working DL80 (igb) machine:
>
> https://gist.github.com/bra-fsn/128ae9a3bbc0dbdbb2f6f4b3e2c5157a/raw/a8ade8af252f618c84a46da2452d557ebc5078ac/DL80%2520igb%2520good.txt
>
> This is the console output from a working boot for the XL225n (bnxt)
> machine:
>
> https://gist.github.com/bra-fsn/128ae9a3bbc0dbdbb2f6f4b3e2c5157a/raw/a8ade8af252f618c84a46da2452d557ebc5078ac/XL225n%2520bnxt%2520good.txt
> as you can see, it's much slower than the DL80 (which also isn't that
> fast...)
>
> And this one is a longer output, without success to that point (2 minutes
> without completing the DHCP flow):
> https://gist.github.com/bra-fsn/128ae9a3bbc0dbdbb2f6f4b3e2c5157a/raw
> <https://gist.github.com/bra-fsn/128ae9a3bbc0dbdbb2f6f4b3e2c5157a/raw/a8ade8af252f618c84a46da2452d557ebc5078ac/XL225n%2520bnxt%2520long.txt>
> /
> <https://gist.github.com/bra-fsn/128ae9a3bbc0dbdbb2f6f4b3e2c5157a/raw/a8ade8af252f618c84a46da2452d557ebc5078ac/XL225n%2520bnxt%2520long.txt>
> a8ade8af252f618c84a46da2452d557ebc5078ac/XL225n%2520bnxt%2520long.txt
> <https://gist.github.com/bra-fsn/128ae9a3bbc0dbdbb2f6f4b3e2c5157a/raw/a8ade8af252f618c84a46da2452d557ebc5078ac/XL225n%2520bnxt%2520long.txt>
>
> For the latter, here's an excerpt from the DHCP log:
>
> https://gist.githubusercontent.com/bra-fsn/128ae9a3bbc0dbdbb2f6f4b3e2c5157a/raw/a8ade8af252f618c84a46da2452d557ebc5078ac/dhcp_log.txt
>
> It seems the DHCP state always gets reset to IF_DHCP_UNRESOLVED even if
> there's answers from the DHCP server.
>
> Here's another, longer console log, which succeeded after spending 236
> seconds in the loop:
>
> https://gist.github.com/bra-fsn/128ae9a3bbc0dbdbb2f6f4b3e2c5157a/raw/a77f52f5e83c699b38a7c2d3acdc52d26ceeba71/XL225n%2520bnxt%2520long%2520good.txt
>
> Any ideas about this?
>
>