14-STABLE crashes when using geli from automountd executable maps

From: Andre Albsmeier <Andre.Albsmeier_at_siemens.com>
Date: Sun, 05 May 2024 17:10:58 UTC
Before opening a PR, let's see if I'm doing something bad here (shouldn't
be that bad as it worked on FreeBSD-12 :-)).

I can reliably crash 14-STABLE by running "geli attach" from an automountd
executable map if 

a) there is no other geli device in use
b) the mount is done r/w.

To reproduce:

0. Prerequisite:
----------------

We assume there is a working autofs environment and /etc/auto_master is the
master map. The must be no geli device in use.


1. Do this once:
----------------

cat << EOF > /etc/autocrash
#!/bin/sh

case "x\$1" in

  xno_geli)
    exec echo "-fstype=ufs,noatime,async :/dev/md0.eli"
    ;;

  xgeli_ro)
    geli attach -k /bin/ls -p /dev/md0
    exec echo "-fstype=ufs,noatime,async,ro :/dev/md0.eli"
    ;;

  xgeli_rw)
    geli attach -k /bin/ls -p /dev/md0
    exec echo "-fstype=ufs,noatime,async :/dev/md0.eli"
    ;;

esac
EOF

chmod 755 /etc/autocrash

echo "/autocrash autocrash" >> /etc/auto_master

automount


2. Do this each time you want to crash the box:
-----------------------------------------------

kldload geom_eli.ko

dd if=/dev/zero of=/tmp/testcrash bs=64k count=160
mdconfig -a -t vnode -f /tmp/testcrash
geli init -P -K /bin/ls /dev/md0
geli attach -k /bin/ls -p /dev/md0
newfs /dev/md0.eli

# this works w/o crashing (as md0.eli is still attached)
cd /autocrash/no_geli ; ls -la
cd / && umount /autocrash/no_geli
geli detach /dev/md0

# crash it:
cd /autocrash/geli_rw
ls -la

# this also works w/o crashing (as we mount it readonly)
cd /autocrash/geli_ro ; ls -la
cd / && umount /autocrash/geli_ro
geli detach /dev/md0



For some reasons, the box crashes when "geli attach" is executed from within
the /etc/autocrash script. It does not crash when we attach it before and do
only the mount from /etc/autocrash. It also does not crash if there is at
least one more geli device in use.

Debugging the crash gives us this:

Fatal trap 12: page fault while in kernel mode
cpuid = 1; apic id = 01
fault virtual address   = 0x218
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff8071a20a
stack pointer           = 0x28:0xfffffe00743d5da0
frame pointer           = 0x28:0xfffffe00743d5dd0
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 1246 (g_eli[1] md0)
rdi: 0000000000000000 rsi: 0000000000000200 rdx: 0000000000000001
rcx: 0000000000000080  r8: 0000000000000001  r9: 0000000000010000
rax: fffff80006da5000 rbx: fffff800063904b0 rbp: fffffe00743d5dd0
r10: 0000000000000001 r11: 0000000000010000 r12: 0000000000000200
r13: 0000000000000008 r14: 0000000000000000 r15: 0000000000000000
trap number             = 12
panic: page fault
cpuid = 1
time = 1714666291
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00743d5b80
vpanic() at vpanic+0xfa/frame 0xfffffe00743d5bb0
panic() at panic+0x43/frame 0xfffffe00743d5c10
trap_fatal() at trap_fatal+0x40c/frame 0xfffffe00743d5c70
trap_pfault() at trap_pfault+0xab/frame 0xfffffe00743d5cd0
calltrap() at calltrap+0x8/frame 0xfffffe00743d5cd0
--- trap 0xc, rip = 0xffffffff8071a20a, rsp = 0xfffffe00743d5da0, rbp = 0xfffffe00743d5dd0 ---
uma_zalloc_arg() at uma_zalloc_arg+0x3a/frame 0xfffffe00743d5dd0
g_eli_alloc_data() at g_eli_alloc_data+0x49/frame 0xfffffe00743d5df0
g_eli_crypto_run() at g_eli_crypto_run+0x97/frame 0xfffffe00743d5e90
g_eli_worker() at g_eli_worker+0x369/frame 0xfffffe00743d5ef0
fork_exit() at fork_exit+0x82/frame 0xfffffe00743d5f30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00743d5f30
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
Uptime: 31s
Dumping 212 out of 3438 MB:..8%..16%..23%..31%..46%..53%..61%..76%..83%..91
__curthread () at /src/src-14/sys/amd64/include/pcpu_aux.h:57
57              __asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct pcpu,
(kgdb) where
#0  __curthread () at /src/src-14/sys/amd64/include/pcpu_aux.h:57
#1  doadump (textdump=textdump@entry=1) at /src/src-14/sys/kern/kern_shutdown.c:405
#2  0xffffffff804f4e40 in kern_reboot (howto=260) at /src/src-14/sys/kern/kern_shutdown.c:523
#3  0xffffffff804f5307 in vpanic (fmt=0xffffffff8081968d "%s", ap=ap@entry=0xfffffe00743d5bf0) at /src/src-14/sys/kern/kern_shutdown.c:967
#4  0xffffffff804f50e3 in panic (fmt=<unavailable>) at /src/src-14/sys/kern/kern_shutdown.c:891
#5  0xffffffff807b0d7c in trap_fatal (frame=0xfffffe00743d5ce0, eva=536) at /src/src-14/sys/amd64/amd64/trap.c:952
#6  0xffffffff807b0e2b in trap_pfault (frame=0xfffffe00743d5ce0, usermode=false, signo=<optimized out>, ucode=0x0)
    at /src/src-14/sys/amd64/amd64/trap.c:760
#7  <signal handler called>
#8  uma_zalloc_arg (zone=0x0, udata=udata@entry=0x0, flags=1) at /src/src-14/sys/vm/uma_core.c:3738
#9  0xffffffff8045a7b9 in uma_zalloc (zone=0x0, zone@entry=0xfffff800063904b0, flags=1) at /src/src-14/sys/vm/uma.h:367
#10 g_eli_alloc_data (bp=bp@entry=0xfffff800063904b0, sz=4096) at /src/src-14/sys/geom/eli/g_eli.c:958
#11 0xffffffff80463c27 in g_eli_crypto_run (wr=wr@entry=0xfffff800068c4880, bp=bp@entry=0xfffff800063904b0)
    at /src/src-14/sys/geom/eli/g_eli_privacy.c:282
#12 0xffffffff8045c6a9 in g_eli_worker (arg=arg@entry=0xfffff800068c4880) at /src/src-14/sys/geom/eli/g_eli.c:752
#13 0xffffffff804b6b02 in fork_exit (callout=0xffffffff8045c340 <g_eli_worker>, arg=0xfffff800068c4880, frame=0xfffffe00743d5f40)
    at /src/src-14/sys/kern/kern_fork.c:1159
#14 <signal handler called>
(kgdb) up 10
#10 g_eli_alloc_data (bp=bp@entry=0xfffff800063904b0, sz=4096) at /src/src-14/sys/geom/eli/g_eli.c:958
958                     bp->bio_driver2 = uma_zalloc(g_eli_uma, M_NOWAIT |
(kgdb) p g_eli_uma
$1 = (uma_zone_t) 0x0
(kgdb)

g_eli_uma being NULL is probably wrong. Probably it got free'ed but shouldn't
or some initialisation is missing.  But why is it NULL when called from
/etc/autocrash and not when run manually or if another geli device already
exists?