GEOM probes fail on aac with EARLY_AP_STARTUP

john hood cgull at glup.org
Thu Sep 7 23:19:50 UTC 2017


I've got a devel machine here which was failing to boot on our vendored
FreeBSD 11.1, because GEOM was unable to find the partitions on the boot
drive and so the root mount failed.  This started happening on many but
not all boots after I upgraded the machine from 9.3.

The machine is an Intel S25520UR motherboard with 2x Xeon E5620 CPUs
(Hyperthreading enabled, so hw.ncpu=16) and an Adaptec 5805, and 2 RAID
volumes configured on 6 SATA drives.

When booting, it sees the aac0 controller and aacd0
volume but GEOM does not find any of the partitions on that volume, and the
initial mount of root on /dev/aacd0p2 fails.  aacd0 is available and
readable, but the expected aacd0p{1,2,3} devices do not exist.
(However, aacd1 and its partitions/devices are configured normally.)

I think it's a race condition between the aac driver and GEOM probing,
probably newly triggered/exposed by EARLY_AP_STARTUP.  I've reproduced
the problem on upstream FreeBSD 11.1 and -current.  Disabling
EARLY_AP_STARTUP, or setting kern.smp.disabled=1, causes the kernel to
start correctly. 'boot -v' also causes the kernel to start correctly.

The kernel calls aac_attach() which uses
configure_intrhook_establish() to run aac_startup() later.  When that
runs, it adds devices via
aac_add_container()/device_add_child()/bus_generic_attach().

However, at the beginning of aac_attach(), an AAC_STATE_SUSPEND flag
is set.  It is cleared at the end of aac_startup().  It appears that
GEOM probes call aac_disk_open(), which checks the flag and returns
error if it is set.  On my system the race is that the GEOM probes
happen before that flag is cleared, possibly because GEOM is tasting
aacd0 while the aac driver is still attaching aacd1.  So the GEOM probes
fail and the geom nodes never get created.  If I boot with the -v flag,
the kernel boots successfully, I think because the message printing
takes long enough to delay GEOM probing past aac_start() completion.

I've attached a patch which resolves the problem on FreeBSD-current (and 11.1), would anybody care to adopt it and shepherd it into SVN?

regards,

  --John Hood

-------------- next part --------------
Only in sys/amd64/compile: AACPROBE
Only in sys/amd64/conf: AACPROBE
Only in sys/amd64/conf: AACPROBE~
diff -u -r sys.orig/dev/aac/aac.c sys/dev/aac/aac.c
--- sys.orig/dev/aac/aac.c	2017-09-05 09:06:26.000000000 -0400
+++ sys/dev/aac/aac.c	2017-09-07 14:27:32.461528000 -0400
@@ -418,9 +418,6 @@
 	sc = (struct aac_softc *)arg;
 	fwprintf(sc, HBA_FLAGS_DBG_FUNCTION_ENTRY_B, "");
 
-	/* disconnect ourselves from the intrhook chain */
-	config_intrhook_disestablish(&sc->aac_ich);
-
 	mtx_lock(&sc->aac_io_lock);
 	aac_alloc_sync_fib(sc, &fib);
 
@@ -437,12 +434,15 @@
 	aac_release_sync_fib(sc);
 	mtx_unlock(&sc->aac_io_lock);
 
+	/* mark the controller up */
+	sc->aac_state &= ~AAC_STATE_SUSPEND;
+
 	/* poke the bus to actually attach the child devices */
 	if (bus_generic_attach(sc->aac_dev))
 		device_printf(sc->aac_dev, "bus_generic_attach failed\n");
 
-	/* mark the controller up */
-	sc->aac_state &= ~AAC_STATE_SUSPEND;
+	/* disconnect ourselves from the intrhook chain */
+	config_intrhook_disestablish(&sc->aac_ich);
 
 	/* enable interrupts now */
 	AAC_UNMASK_INTERRUPTS(sc);
Only in sys/dev/aac: aac.c.orig
Only in sys/dev/aac: aac.c~
Only in sys/dev/aac: aac_disk.c.orig
Only in sys/dev/aac: aac_disk.c~
Only in sys/geom: geom_disk.c.orig
Only in sys/geom: geom_disk.c~


More information about the freebsd-scsi mailing list