For snapshot builds: armv7 chroot on aarch64 has kyua test -k /usr/tests/Kyuafile sys/kern/kern_copyin hung up [in getpid?], unkillable, prevents reboot

From: Mark Millard <marklmi_at_yahoo.com>
Date: Mon, 26 Jun 2023 00:16:09 UTC
Using the likes of:

FreeBSD-14.0-CURRENT-arm64-aarch64-ROCK64-20230622-b95d2237af40-263748.img
and:
FreeBSD-14.0-CURRENT-arm-armv7-GENERICSD-20230622-b95d2237af40-263748.img

I have shown the following behavior after setting up storage
media based on them. (This was a test that my builds were not
odd for the issue.)

Boot the aarch64 media and log in. (Note: I logged in
as root.)

mount the armv7 media (-noatime is just my habit)
and then put it to use:

# mount -onoatime /dev/da1s2a /mnt

# chroot /mnt/

# kyua test -k /usr/tests/Kyuafile sys/kern/kern_copyin
sys/kern/kern_copyin:kern_copyin  ->  

On the serial console:

# ps -xu
USER  PID   %CPU %MEM   VSZ  RSS TT  STAT STARTED      TIME COMMAND
root   11 1498.4  0.0     0  256  -  RNL  23:24   542:52.92 [idle]
root 1174  100.0  0.0     0   16  -  Rs   23:37     0:00.00 /usr/tests/sys/kern/kern_copyin -vunprivileged-user=tests -r/tmp/kyua.9YUttj/2/result.atf kern_copyin
root    0    0.0  0.0     0 1616  -  DLs  23:24     0:00.50 [kernel]
root    1    0.0  0.0 11704 1288  -  ILs  23:24     0:00.02 /sbin/init
root    2    0.0  0.0     0  256  -  WL   23:24     0:00.26 [clock]
root    3    0.0  0.0     0  272  -  DL   23:24     0:00.00 [crypto]
root    4    0.0  0.0     0   80  -  DL   23:24     0:00.95 [cam]
root    5    0.0  0.0     0   16  -  DL   23:24     0:00.00 [busdma]
root    6    0.0  0.0     0   16  -  DL   23:24     0:00.03 [rand_harvestq]
root    7    0.0  0.0     0   48  -  DL   23:24     0:00.06 [pagedaemon]
root    8    0.0  0.0     0   16  -  DL   23:24     0:00.00 [vmdaemon]
root    9    0.0  0.0     0  160  -  DL   23:24     0:00.38 [bufdaemon]
root   10    0.0  0.0     0   16  -  DL   23:24     0:00.00 [audit]
root   12    0.0  0.0     0  880  -  WL   23:24     0:11.81 [intr]
root   13    0.0  0.0     0   48  -  DL   23:24     0:00.04 [geom]
root   14    0.0  0.0     0   16  -  DL   23:24     0:00.00 [sequencer 00]
root   15    0.0  0.0     0  160  -  DL   23:24     0:06.42 [usb]
root   16    0.0  0.0     0   16  -  DL   23:24     0:00.10 [acpi_thermal]
root   17    0.0  0.0     0   16  -  DL   23:24     0:00.00 [acpi_cooling0]
root   18    0.0  0.0     0   16  -  DL   23:24     0:00.04 [syncer]
root   19    0.0  0.0     0   16  -  DL   23:24     0:00.00 [vnlru]
root  671    0.0  0.0 13260 2600  -  Is   23:25     0:00.00 dhclient: system.syslog (dhclient)
root  674    0.0  0.0 13260 2752  -  Is   23:25     0:00.00 dhclient: dpni0 [priv] (dhclient)
root  761    0.0  0.0 14572 3972  -  Ss   23:25     0:00.02 /sbin/devd
root  964    0.0  0.0 12832 2764  -  Is   23:25     0:00.02 /usr/sbin/syslogd -s
root 1033    0.0  0.0 13012 2604  -  Ss   23:25     0:00.01 /usr/sbin/cron -s
root 1058    0.0  0.0 21052 8308  -  Is   23:25     0:00.01 sshd: /usr/sbin/sshd [listener] 0 of 10-100 startups (sshd)
root 1078    0.0  0.0 21288 9304  -  Is   23:26     0:00.09 sshd: root@pts/0 (sshd)
root 1175    0.0  0.0 21288 9496  -  Is   23:37     0:00.04 sshd: root@pts/1 (sshd)
root 1074    0.0  0.0 13380 3008 u0  Is   23:25     0:00.01 login [pam] (login)
root 1075    0.0  0.0 13460 3292 u0  S    23:25     0:00.02 -sh (sh)
root 1233    0.0  0.0 13588 3016 u0  R+   00:00     0:00.00 ps -xu
root 1081    0.0  0.0 13460 3328  0  Is   23:26     0:00.02 -sh (sh)
root 1170    0.0  0.0  5788 2884  0  I    23:36     0:00.02 /bin/sh -i
root 1172    0.0  0.0 10408 7192  0  I+   23:37     0:00.30 kyua test -k /usr/tests/Kyuafile sys/kern/kern_copyin
root 1178    0.0  0.0 13460 3320  1  Is+  23:38     0:00.01 -sh (sh)

1174 is stuck, even if one waits for 30min+.
kill and kill -9 will not kill 1174.

"shutdown -r now" hangs before the reboot happens
and reports: "some processes would not die".

An interesting property is that ps and top disagree
about 1174 CPU usage: ps 100%, top 0%. But top also
indicates 1174 always has CPU0 "STATE". (Across
tests CPUn varies but within a test it has
a fixed n.)

I have also seen ps "STAT" being RXs.

The following is from my earlier activity with my own
builds involved, here 1119, not the 1174 from above.
truss reports as the last thing for the stuck process
as "getpid()".

. . .
1119: 0.588983953 fstatat(AT_FDCWD,"/usr/tests/sys/kern/kern_copyin",{ mode=-r-xr-xr-x ,inode=111756,size=9776,blksize=10240 },AT_SYMLINK_NOFOLLOW) = 0 (0x0)
1119: 0.589065030 mmap(0x0,20480,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANON|MAP_ALIGNED(12),-1,0x0) = 1074188288 (0x4006d000)
1119: 0.589227544 openat(AT_FDCWD,"/tmp/kyua.aBQv6E/2/result.atf",O_WRONLY|O_CREAT|O_TRUNC,0644) = 3 (0x3)
1119: 0.589276503 getpid()                      = 1119 (0x45f)



For reference, from inside an armv7 chroot session
before doing such a test:

# uname -apKU
FreeBSD generic 14.0-CURRENT FreeBSD 14.0-CURRENT #0 main-n263748-b95d2237af40: Thu Jun 22 11:10:50 UTC 2023     root@releng1.nyi.freebsd.org:/usr/obj/usr/src/arm64.aarch64/sys/GENERIC arm armv7 1400090 1400090

===
Mark Millard
marklmi at yahoo.com