Re: More swap trouble with armv7, was Re: -current on armv7 stuck with flashing disk light

Reply: Mark Millard : "Re: More swap trouble with armv7, was Re: -current on armv7 stuck with flashing disk light"
In reply to: bob prohaska : "More swap trouble with armv7, was Re: -current on armv7 stuck with flashing disk light"
Go to: [ bottom of page ] [ top of archives ] [ this month ]
From: Mark Millard <marklmi_at_yahoo.com>
Date: Tue, 04 Jul 2023 21:22:18 UTC
On Jul 4, 2023, at 12:07, bob prohaska <fbsd@www.zefox.net> wrote:

> On Tue, Jun 27, 2023 at 10:16:57AM -0700, bob prohaska wrote:
>> On Tue, Jun 27, 2023 at 09:59:40AM -0700, Mark Millard wrote:
>>>> 
>>>> If you want to identify system hangs, please
>>>> put back:
>>>> 
>>>> vm.swap_enabled=0
>>>> vm.swap_idle_enabled=0
>>>> 
>> 
>> They're reinstated now, but I don't want to disturb the system
>> while it seems to be building world acceptably. 
>> 
> Reinstating 
> vm.swap_enabled=0
> vm.swap_idle_enabled=0
> 
> and limiting buildworld to -j3 allows buildworld to complete successfully in 1 GB of swap.
> 
> Meanwhile, attempts to compile sysutils/usbtop using poudriere still cause swap exhaustion
> while compiling /devel/llvm15 even with 2 GB of swap allocated. 

What sort of parallelism settings in poudriere for the
devel/llvm15 build attempt? Have you tried allowing
less parallelism (if there is a less for what you have
tried)?

What options are enabled vs. disabled for devel/llvm15 ?

BE_STANDARD vs. BE_FREEBSD vs. BE_NATIVE ?

BE_NATIVE probably help limit resource use the most if it
happens to be sufficient. BE_FREEBSD would be in the
middle of the 3 options for this issue.

Is MLIR enabled? If having it disabled is sufficient, it
being disabled should help avoid as much resource use.
Simiarly for FLANG. (Building FLANG requires MLIR, so
having MLIR disabled implies FLANG needing to also be
disabled.)

> The messages are
> Jul  4 11:18:48 www kernel: pid 1074 (getty), jid 0, uid 0, was killed: out of swap space

In my view the "out of swap space" is still a misleading
misnomer for this context, but at least the following
messages are more specific to the actual internal
data-structure(s) problem(s). My understanding is that
the data structures can have fragmentation issues.

For fragmentation issues, prior history since booting
might contribute, and building just after a reboot may
end up with less fragmentation. (Unknown if sufficiently
less.)

Also, over allocating the swap partition (by not having
kern.maxswzone appropriately matching) likely makes
"swap blk zone exhausted" more likely. It is one of the
reasons I avoid using swap partitioning with a total
size that generates the message about possible
mistuning.

> swap blk zone exhausted, increase kern.maxswzone

Have you ever gotten the above line before? I was
unaware of any examples of it showing up.

> swblk zone ok

I'll note that there is another potential message
pair for "swap pctrie zone exhausted"/"swpctrie zone ok"
that you have not reported getting.

Have you ever seen the "swap pctrie zone exhausted"
notice? (Just curiosity on my part.)

> IIRC the "increase kern.maxswzone" is unhelpful, if not impossible. The
> "swblk zone ok" seems new. 

Are you using the default kern.maxswzone for your context?
What is its value?

Did you get the notice about possible mistuning for your
combination of swap partition sizing and kern.maxswzone
value? Or did "swap blk zone" happen even without that
notice happening?

> From the gstat output near peak swap use the system wasn't I/O bound,

The "swap blk zone" contains an in-kernel-RAM data
structure that is involved in managing the swap space
usage.

> the disk was less than 25% busy at the time of the first OOMA kill.

"swap blk zone" can end up with fragmentation issues, where
the total available is only made up of a bunch of tiny chunks
and nothing large can be handled as a unit any more. (A general
description of "fragmented".)

> Eventually it was possible to log in on the serial console and run top:
> 
> 33 processes:  1 running, 29 sleeping, 3 zombie
> CPU:  0.0% user,  0.0% nice, 10.6% system,  0.2% interrupt, 89.2% idle
> Mem: 139M Active, 8256K Inact, 252M Laundry, 221M Wired, 98M Buf, 292M Free
> Swap: 2048M Total, 1291M Used, 756M Free, 63% Inuse
> 
>  PID   JID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
> 40719     0 root          1  20  -20     0B  8192B swzonx   0   0:12   9.15% cron
> 40717     0 root          1  20  -20     0B  8192B swzonx   0   0:34   9.08% sh
> 40709     0 root          1  20  -20     0B  8192B swzonx   0   0:38   9.01% sshd
> 40720     0 root          1  20  -20     0B  8192B swzonx   3   0:13   7.47% sh

Unfortunately the swzonx text is truncated. There is
actually:

pause("swzonxb", 10); for swblk zone
and:
pause("swzonxp", 10); for swap pctrie zone

top's display leaves it unclear which was involved.

> 40721     0 bob           1  20    0  6608K  2600K CPU1     1   0:00   0.32% top
> 25761     0 bob           1  20    0    14M  6136K select   0   0:02   0.03% sshd
> 25852     0 root          1  20    0  4668K  1648K ttyin    1   0:01   0.03% tip
> 1237     0 root          1  20    0  5820K  1540K wait     1   0:12   0.00% sh
> 25381     0 root          1  23    0    14M  5868K select   1   0:01   0.00% sshd
> 1030     0 root          1  24    0    13M  2416K vmbckw   1   0:00   0.00% sshd
> 12715     0 root          1  68    0  5820K  1660K wait     0   0:00   0.00% sh
> 12710     0 root          1  20    0  5820K  1556K piperd   1   0:00   0.00% sh
>  929     0 root          1  20    0  5356K  1256K select   3   0:00   0.00% syslogd
> 1014     0 root          1  20    0  5124K  1356K nanslp   2   0:00   0.00% cron
> 25770     0 bob           1  36    0  6844K  3116K pause    1   0:00   0.00% tcsh
> 25794     0 bob           1  24    0  5380K  2188K wait     2   0:00   0.00% su
> 39626     0 root          1  20    0  5424K  2404K wait     2   0:00   0.00% login
> 40635     0 bob           1  20    0  6824K  3272K pause    1   0:00   0.00% tcsh
> 25820     0 root          1  21    0  5608K  2204K wait     0   0:00   0.00% sh
> 25851     0 root          1  20    0  4668K  1656K ttyin    3   0:00   0.00% tip
> 40454     0 root          1  24    0  4636K  1780K ttyin    3   0:00   0.00% getty
> 
> I'll let it go for a while to see if poudriere notices it's failed and cleans up.
> 
> At the moment /boot/loader.conf contains
> 
> # Configure USB OTG; see usb_template(4).
> hw.usb.template=3
> umodem_load="YES"
> # Disable the beastie menu and color
> beastie_disable="YES"
> loader_color="NO"
> vm.pageout_oom_seq="4096"
> vm.pfault_oom_attempts="3"
> vm.pfault_oom_attempts="120"

2 assignments to the same thing in a row?
The 2nd ends up controlling the value.

> vm.pfault_oom_wait="20"

So you are allowing it 120 * 20 sec == 2400 sec
(in other words, 40 minutes of retrying every 20
seconds) to handle a page fault.

That time scale may have contributed to why it
failed first for "swap blk zone exhausted"
instead of more usual types of OOM cause:
How many page faults had active 40 minute
intervals at the time?

You may be just moving around where a problem
shows up, not leading to lack of a failure
overall.

> kern.cam.boot_delay="20000"
> vfs.ffs.dotrimcons="1"
> vfs.root_mount_always_wait="1"
> filemon_load="YES"
> 
> /usr/local/etc/poudriere.conf contains
> USE_TMPFS=no
> NOHANG_TIME=28800
> MAX_EXECUTION_TIME_EXTRACT=14400
> MAX_EXECUTION_TIME_INSTALL=14400
> MAX_EXECUTION_TIME_PACKAGE=432000
> ALLOW_MAKE_JOBS=yes
> MAX_JOBS_NUMBER=2

I do not remember there being a MAX_JOBS_NUMBER in
the infrastructure. So I will ignore that line. It
probably should be deleted.

> MAKE_JOBS_NUMBER=2
> 
> Do these settings look reasonable?

ALLOW_MAKE_JOBS/MAX_JOBS_NUMBER is not independent
of what is being built. There is no global, single
answer to "looks reasonable" for them.

However, MAX_JOBS_NUMBER is in the wrong file.
It is from/for make, not from/for poudriere
directly. (But there is a way for poudriere
to contribute such to make.)

For example (from a grep):

/usr/local/etc/poudriere.d/make.conf:MAKE_JOBS_NUMBER=2

( MAKE_JOBS_NUMBER_LIMIT is the same for where it
goes. )

You might need to use MAX_JOBS_NUMBER=1 or
to not assign to ALLOW_MAKE_JOBS to have a
chance to have the devel/llvm15 build fit
if you have already turned off options that
avoid using resources for building what you
do not need.

===
Mark Millard
marklmi at yahoo.com