[Bug 279951] dhclient unable to reuse recorded lease after timeout, since 12.1

From: <bugzilla-noreply_at_freebsd.org>
Date: Mon, 24 Jun 2024 14:16:44 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=279951

            Bug ID: 279951
           Summary: dhclient unable to reuse recorded lease after timeout,
                    since 12.1
           Product: Base System
           Version: 15.0-CURRENT
          Hardware: Any
                OS: Any
            Status: New
          Severity: Affects Many People
          Priority: ---
         Component: bin
          Assignee: bugs@FreeBSD.org
          Reporter: viktor.stujber+freebsd-bugs_v4CCPfay@gmail.com

At boot time, dhclient is supposed to fall back to /var/db/dhclient.leases when
it reaches 'timeout' (60s) without getting a DHCPOFFER response.

For 3 years I've been observing that my freebsd router no longer started its
services properly after a power outage, when previously it all worked fine and
there were no relevant configuration changes, aside from gradual OS upgrades.
The core reason was that dhclient's behavior changed in 12.1-RELEASE from "No
DHCPOFFERS received. Trying recorded lease X. bound: renewal in Y seconds." to
"No working leases in persistent database - sleeping." and leaving the WAN
interface with no ip address.
The supporting reason is that this only happens when the ISP's network is slow
to recover, and when the server has no battery backup. If there is no reboot,
the already running dhclient most likely just retains the active lease and ip
binding.

I have reproduced this issue in a VM by letting it boot once to get the lease,
then firewalling off the VM on the hypervisor level and rebooting (cannot
simply disconnect link; cannot use on-host pf because it starts too late). I
have tested 15.0-CURRENT-20240620, 14.0-RELEASE, 12.4-RELEASE, 12.1-RELEASE and
all trivially reproduce it. 12.0-RELEASE doesn't. 

Via tedious buildworld bisection, I have identified the cause as b6c2f6eb (base
r344237) from 2019-02-17. In main branch these were 95f237c2 (base r343896)
followed by 3b08e0fc (base r343922) from 2019-02-08.
https://cgit.freebsd.org/src/commit/sbin/dhclient/dhclient.c?id=b6c2f6eb29f4ca183a8877ccedd97ee98d583df3
https://cgit.freebsd.org/src/commit/sbin/dhclient/dhclient.c?id=95f237c2f65130b6567e69df06c393586e3969a3
https://cgit.freebsd.org/src/commit/sbin/dhclient/dhclient.c?id=3b08e0fcf357c1a905c5e59731930528fb94a0b1

MFC r343896,r343922: dhclient: Pass through exit status from script
@@ -2353,2 +2353,3 @@ priv_script_go(void)
-       return (wstatus & 0xff);
+       return (WIFEXITED(wstatus) ?
+           WEXITSTATUS(wstatus) : 128 + WTERMSIG(wstatus));
 }

/usr/include/sys/wait.h:#define>_WSTATUS(x) (_W_INT(x) & 0177)
/usr/include/sys/wait.h:#define WTERMSIG(x)     (_WSTATUS(x))
/usr/include/sys/wait.h:#define WIFEXITED(x)    (_WSTATUS(x) == 0)
/usr/include/sys/wait.h:#define WEXITSTATUS(x)  (_W_INT(x) >> 8)

A bunch of internal dhclient logic isn't handled directly by C code; instead,
the code sets up parameters for a 'script' and then runs it via script_go(). I
found 12 such places in dhclient.c. At least 5 places treat the return value as
a boolean 0=success, !0=error. Crucially, the lease reuse logic in
dhclient.c::state_panic() does this, invoking dhclient-script 'TIMEOUT' (add
new address) operation.

            note("Trying recorded lease %s",
piaddr(ip->client->active->address));
            /* Run the client script with the existing parameters. */
            script_init("TIMEOUT", ip->client->active->medium);
            script_write_params("new_", ip->client->active);
            if (ip->client->alias)
                script_write_params("alias_", ip->client->alias);

            /* If the old lease is still good and doesn't
               yet need renewal, go into BOUND state and
               timeout at the renewal time. */
            if (!script_go()) {
                if (cur_time < ip->client->active->renewal) {
                    ip->client->state = S_BOUND;
                    note("bound: renewal in %d seconds.",

Adding debug logging code reveals that wstatus is 256, meaning that originally
the return value was 0, but after the change it became 1. This is because
instead of just returning the wait() outcome of the forked subprocess (exited /
signaled / coredumped), it now instead provides the script's programmatic exit
code. And the way dhclient-script's TIMEOUT is structured, it exits with 0 in
some specific cases, and 1 in the default path.

The abovementioned change thus shows a fundamental lack of understanding how
dhclient's code operates, resulting in a 'return type confusion' programming
mistake. The immediate fix would be to review each call to script_go() that
checks the result, and ensure that the check aligns with the exit codes in
dhclient-script.

Curiously, one of the commits notes that it's identical to openbsd's 1.117 from
2008 ( https://cvsweb.openbsd.org/src/sbin/dhclient/dhclient.c ). The code
structure is nearly identical, so I presume that change also introduced this
same issue to openbsd. From the revision log I see that it seems to also have
went unnoticed, for 4 years, until 1.159 'nuked' the entire script integration
framework and thus unintentionally fixed it.

-- 
You are receiving this mail because:
You are the assignee for the bug.