[Bug 279742] 14.1-RELEASE hangs compiling pspp requiring reboot

From: <bugzilla-noreply_at_freebsd.org>
Date: Sat, 15 Jun 2024 02:23:53 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=279742

            Bug ID: 279742
           Summary: 14.1-RELEASE hangs compiling pspp requiring reboot
           Product: Base System
           Version: 14.0-STABLE
          Hardware: amd64
                OS: Any
            Status: New
          Severity: Affects Many People
          Priority: ---
         Component: kern
          Assignee: bugs@FreeBSD.org
          Reporter: dgilbert@eicat.ca

Created attachment 251458
  --> https://bugs.freebsd.org/bugzilla/attachment.cgi?id=251458&action=edit
core.txt of crash.

I clicked on 14.0-STABLE because 14.1-RELEASE was not yet a choice.

I upgraded my poudriere box to 14.1, created a new jail for 14.1, and launched
into a "-a" build pretty much immediately after returning from BSDCan.  The
build machine is a Threadripper 1900X with 128G of RAM and 140TB of disk in
RAID-Z2.  It has stably built poudriere almost constantly since I upgraded it
to it's current state --- about 3 years or so.

After the first poudriere hang, I instrumented things like temperatures.  None
of these spiked, but the hang happened again and again.  After awhile, it was
clear that pspp compiling was the trigger.  Note that pspp would have compiled
under 14.0 less than a week before (ie: just before BSDCan).

I had to get debugging in to my kernel and learn how to cause it to debug. 
That took a couple tries --- all-the-while repeatedly crashing while pspp was
building.  Top was up on the window I keep open ... and this was the last top
on display.

last pid: 31372;  load averages: 21.72, 32.46, 41.6670                         
                        up 0+04:34:59  20:36:48
220 processes: 12 running, 192 sleeping, 2 zombie, 14 waiting
CPU: 21.7% user,  0.0% nice, 40.4% system,  0.0% interrupt, 37.8% idle
Mem: 32M Active, 264K Inact, 124G Wired, 604M Free
ARC: 16G Total, 230M MFU, 334M MRU, 22M Anon, 15G Header, 191M Other
     107M Compressed, 460M Uncompressed, 4.28:1 Ratio
Swap: 256G Total, 98G Used, 158G Free, 38% Inuse, 2868K In, 3612K Out

  PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
61759 root          2 166  i10    60M  2616K vofflo   3  40:43  75.20%
pspp-output
 6367 root          2 166  i10    88M  2604K vofflo  15  36:06  73.63%
pspp-output
15409 root          2 166  i10    92M  2608K vofflo   2  33:53  72.64%
pspp-output
81893 root          2 166  i10    86M  2600K CPU12   12  34:05  72.04%
pspp-output
78622 root          2 166  i10    57M  2588K CPU11   11  28:42  69.19%
pspp-output
25531 root          2 166  i10    95M  2616K CPU5     5  27:00  68.84%
pspp-output
81789 root          2 166  i10    42M  2584K CPU6     6  23:16  65.11%
pspp-output
87988 root          2 166  i10   102M  2596K CPU7     7  20:57  64.28%
pspp-output
11364 root          2 166  i10    57M  2612K CPU10   10  19:50  64.14%
pspp-output
23538 root          2 166  i10    66M  2604K CPU11   11  21:09  63.94%
pspp-output
61379 root          2 166  i10    93M  2624K tmpfs    4  21:10  63.46%
pspp-output
85836 root          2 166  i10    74M  2608K CPU14   14  19:19  62.69%
pspp-output
58400 root          2 166  i10    76M  2440K RUN      5  13:26  56.27%
pspp-output
58294 root          2 166  i10    72M  2444K CPU1     1  14:44  56.15%
pspp-output
70050 root          2 166  i10    48M  2440K RUN      1  12:46  56.10%
pspp-output
 2561 root          1  20    0   303M  1728K select  12   1:09   0.40% smbd
 2502 postgres      1  20    0   173M  1012K select   4   0:13   0.16% postgres
65067 root          1  20    0    17M  1452K CPU9     9   0:21   0.14% top
 2577 root          1  20    0    17M  1216K select   9   0:40   0.13% tmux
72517 root          6 166  i10  2310M   452K uwait    1   9:40   0.07%
ghc-9.6.4
 8903 root         45 166  i10    34G  4716K uwait    0  12:30   0.06% java
 2503 postgres      1  20    0    31M   684K select   7   0:08   0.05% postgres
37351 root          1  20    0    22M   328K select   9   0:03   0.05% sshd
 2190 root          1  20    0    14M   172K select   6   0:00   0.03% syslogd
72294 root         11 166  i10   345M  1664K kqread   4   0:02   0.01% node
 2294 root          1  20    0   280M   228K select  11   0:00   0.01% httpd
 1192 root          1  20    0    18M   340K select   0   0:27   0.01% mountd
 1162 ntpd          1  20    0    23M   520K select  12   0:01   0.01% ntpd
95259 root          1  20    0    12M   328K ttyin    4   0:03   0.01% cu
 1749 uwsgi         1  20    0    57M   412K kqread  12   0:01   0.00%
uwsgi-3.8
36420 root          1  20    0    19M   544K select   5   0:01   0.00% minicom
 1307 root          1  20    0   164M   460K kqread   8   0:00   0.00% php-fpm
 1253 root        128  68    0    12M  2316K rpcsvc  11   0:13   0.00% nfsd
91926 root          2 166  i10    74M  2908K pfault  15 123:07   0.00%
pspp-output
72530 root         11 166  i10  7498M   836K pfault   5  99:32   0.00% node
46100 root         18 166  i10   261G   932K uwait    4  18:33   0.00% dotnet
73028 root          1 166  i10   165M  4096B WAIT    11   3:56   0.00%
<pkg-static>
 2955 root          1 166  i10    15M  4096B wait    13   3:24   0.00% <sh>
93083 root          1 166  i10   195M  4096B WAIT    13   3:14   0.00%
<pkg-static>
22537 root          6 166  i10   298M    17M uwait   10   1:22   0.00% ld.lld
 2588 root          1 166  i10    22M   224K select   6   1:02   0.00% sh
24257 root          1 166  i10   145M  4096B WAIT    12   0:42   0.00%
<pkg-static>
 1301 www           1  20    0    27M  4096B WAIT     5   0:32   0.00% <nginx>
90301 root         14 166  i10   261G   260K uwait    4   0:24   0.00% dotnet

It's worth noting here that the virtual terminal switch (alt - F<n>) works
after this happens, but no other input is recognized (can't hit return in a
window and shells going through the machine to others don't continue their
output).

When it happened this time, I dropped to KDB and dumped.  core.txt attached.

NOTE: this is repeatable.  I have been through the cycle 6 times so far.

-- 
You are receiving this mail because:
You are the assignee for the bug.