Re: Kernel DHCP unpredictable/fails (PXE boot), userspace DHCP works just fine
- In reply to: Yves_Guérin : "Re: Kernel DHCP unpredictable/fails (PXE boot), userspace DHCP works just fine"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Thu, 16 Mar 2023 21:26:40 UTC
Hey, Sure. We're talking about 30 machines, all behave the same (either bad or good). I'm pretty sure it's not a cabling issue. :) Yves Guérin <yvesguerin@yahoo.ca> ezt írta (időpont: 2023. márc. 16., Cs, 22:14): > Dear Attila, > > May be I will add some noise to your thread, sorry in advance, I am just a > sysadmin and I faced the same problem with one of my old hp g7 the network > card was broken (malfunctionning) , sometime it works and sometime not when > I used pxe and dhcpd (take to much time to answer to the dhcp so the > motherboard decided to reboot, etc. (infinite loop)). The card works > perfectly when it's setup by an OS. > > May be it's a stupid question or two: do you check the network cable ? (I > faced some defective cables and it ruin my day...) in the same way what > about the hub/router attached to this server (configuration, etc.), Do you > switched a good one by a bad one ? (same network cable, hub/router, etc.) > > I spend too much nights in the lab... > > Regards, > > Yves Guerin > > > Le jeudi 16 mars 2023 à 16:44:49 UTC−4, Attila Nagy <nagy.attila@gmail.com> > a écrit : > > > Hi, > > As this is super annoying, I'm willing to pay a $500 bounty for solving > this issue (whomever is first, however I don't anticipate a big competition > :) Having an invoice would be best, but I'm willing to accept individuals > as well). > I can't give remote access, but can run debug builds with serial console. > stable/13 branch. > > I have a bunch of netbooted machines, one set in a cluster is older (HP > DL80 G9, 2x8C, Intel I350 -igb- NICs), the other set is newer (HP XL225n > G10, AMD EPYC2x16C, BCM57412 -bnxt- NICs). > All of these boot from the network, which is basically: > - get IP and options with DHCP with the help of the NIC's PXE stack > - get the loader and kernel, start it > - do another round of DHCP from the kernel (bootp_subr.c) > - mount the root via NFS and let everything work as usual > > The problem is that the newer machines take an indefinite time to boot. > The older ones (with igb NIC) work reliably, they always boot fast. > The process of getting an IP address via DHCP (bootpc_call from > bootp_subr.c) either succeeds normally (in a few seconds), or takes a lot > of time. > Common (measured) times to boot range from 10s of minutes to anywhere > between a few hours (1-6). > Sometimes it just gets stuck and couldn't get past bootpc_call (getting > the DHCP lease). > > What I've already tried: > - we have a redundant set of DHCP servers which offer static leases (so > there are two DHCPOFFERs), so I tried to turn off one of them, nothing has > changed > - tried to disable SMP, the effect is the same > - tried to see whether it's a network issue. The NIC's PXE stack always > gets the lease quickly and booting FreeBSD from an ISO and issuing dhclient > on the same interface is also fast. After the machines have booted, there > are no network issues, they work reliably (since more than a year for 20+ > machines, so not just a few hours) > > This issue wasn't so bad previously (only a few mins to tens of minutes > delay), but recently it got pretty unbearable, even making some machines > unbootable for days... > > First I thought it might be a packet loss (or more exactly packet delivery > from the DHCP server to the receiving socket), either in the network or in > the NIC/kernel itself, so I placed a few random printfs into bootp_subr.c > and udp_usrreq.c. > > After spending some time trying to understand the problem it feels like a > race condition in > bootpc_call, but I don't know the code well enough to effectively verify > that. > > Here are the modified bootp_subr.c and udp_usrreq.c: > > https://gist.githubusercontent.com/bra-fsn/128ae9a3bbc0dbdbb2f6f4b3e2c5157a/raw/a8ade8af252f618c84a46da2452d557ebc5078ac/bootp_subr.c > > https://gist.github.com/bra-fsn/128ae9a3bbc0dbdbb2f6f4b3e2c5157a/raw/a8ade8af252f618c84a46da2452d557ebc5078ac/udp_usrreq.c > (modified from stable/13 branch from a few weeks earlier) > > This is the output with the always working DL80 (igb) machine: > > https://gist.github.com/bra-fsn/128ae9a3bbc0dbdbb2f6f4b3e2c5157a/raw/a8ade8af252f618c84a46da2452d557ebc5078ac/DL80%2520igb%2520good.txt > > This is the console output from a working boot for the XL225n (bnxt) > machine: > > https://gist.github.com/bra-fsn/128ae9a3bbc0dbdbb2f6f4b3e2c5157a/raw/a8ade8af252f618c84a46da2452d557ebc5078ac/XL225n%2520bnxt%2520good.txt > as you can see, it's much slower than the DL80 (which also isn't that > fast...) > > And this one is a longer output, without success to that point (2 minutes > without completing the DHCP flow): > https://gist.github.com/bra-fsn/128ae9a3bbc0dbdbb2f6f4b3e2c5157a/raw > <https://gist.github.com/bra-fsn/128ae9a3bbc0dbdbb2f6f4b3e2c5157a/raw/a8ade8af252f618c84a46da2452d557ebc5078ac/XL225n%2520bnxt%2520long.txt> > / > <https://gist.github.com/bra-fsn/128ae9a3bbc0dbdbb2f6f4b3e2c5157a/raw/a8ade8af252f618c84a46da2452d557ebc5078ac/XL225n%2520bnxt%2520long.txt> > a8ade8af252f618c84a46da2452d557ebc5078ac/XL225n%2520bnxt%2520long.txt > <https://gist.github.com/bra-fsn/128ae9a3bbc0dbdbb2f6f4b3e2c5157a/raw/a8ade8af252f618c84a46da2452d557ebc5078ac/XL225n%2520bnxt%2520long.txt> > > For the latter, here's an excerpt from the DHCP log: > > https://gist.githubusercontent.com/bra-fsn/128ae9a3bbc0dbdbb2f6f4b3e2c5157a/raw/a8ade8af252f618c84a46da2452d557ebc5078ac/dhcp_log.txt > > It seems the DHCP state always gets reset to IF_DHCP_UNRESOLVED even if > there's answers from the DHCP server. > > Here's another, longer console log, which succeeded after spending 236 > seconds in the loop: > > https://gist.github.com/bra-fsn/128ae9a3bbc0dbdbb2f6f4b3e2c5157a/raw/a77f52f5e83c699b38a7c2d3acdc52d26ceeba71/XL225n%2520bnxt%2520long%2520good.txt > > Any ideas about this? > >