Re: Troubles building world on stable/13

From: bob prohaska <fbsd_at_www.zefox.net>
Date: Tue, 25 Jan 2022 22:17:53 UTC
On Tue, Jan 25, 2022 at 12:49:02PM -0800, Mark Millard wrote:
> On 2022-Jan-25, at 10:08, bob prohaska <fbsd@www.zefox.net> wrote:
> 
> > On Tue, Jan 25, 2022 at 09:13:08AM -0800, Mark Millard wrote:
> >> 
> >> -DBATCH ? I'm not aware of there being any use of that symbol.
> >> Do you have a documentation reference for it so that I could
> >> read about it?
> >> 
> > It's a switch to turn off dialog4ports. I can't find the reference
> > now. Perhaps it's been deprecated? A name like -DUSE_DEFAULTS would
> > be easier to understand anyway. 
> 
> I've never had buildworld buildkernel or the like try to use
> dialog4ports. I've only had port building use it. buildworld
> and buildkernel can be done with no ports installed at all.
> dialog4ports is a port.
> 

The attempt to build devel/llvm13 under stable/13 was done under ports.
Thus the -DBATCH, to avoid manual intervention.

> I think -DBATCH was ignored for the activity at hand.
> 
> > On a whim, I tried building devel/llvm13 on a Pi4 running -current with 
> > 8 GB of RAM and 8 GB of swap. To my surprise, that stopped with:
> > nemesis.zefox.com kernel log messages:
> > +FreeBSD 14.0-CURRENT #26 main-5025e85013: Sun Jan 23 17:25:31 PST 2022
> > +swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1873450, size: 4096
> > +swap_pager: indefinite wait buffer: bufobj: 0, blkno: 521393, size: 4096
> > +swap_pager: indefinite wait buffer: bufobj: 0, blkno: 209826, size: 12288
> > +swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1717218, size: 24576
> > +pid 56508 (c++), jid 0, uid 0, was killed: failed to reclaim memory
> > 
> > On an 8GB machine, that seems strange. 
> 
> -j<What?> build? -j4 ?
> 
Since this too was a port build, I let ports decide. It settled on 4.

> Were you watching the swap usage in top (or some such)?
>

Top was running but the failure happened overnight. Not expecting 
it to fail, I didn't keep a log of swapping activity. The message
above was in the next morning's log email.
 
> Note: The "was killed" related notices have been improved
> in main, but there is a misnomer case about "out of swap"
> (last I checked).
>
 
> An environment that gets "swap_pager: indefinite wait buffer"
> notices is problematical and the I/O delays for the virtual
> memory subsystem can lead to kills, if I understand right.
> 
> But, if I remember right, the actual message for a directly
> I/O related kill is now different.
> 

In this case the message was "unable to reclaim memory", a 
message I've not seen before. 

> I think that being able to reproduce this case could be
> important. I probably can not because I'd not get the
> "swap_pager: indefinite wait buffer" in my hardware
> context.
>

If it's relevant, the case of /usr/ports/devel/llvm13 seems like
the most expedient test, since it did fail with realistic amounts
of memory and swap. I gather that there's a certain amount of 
self-recompilation in buildworld, is that true of the port version?
Does it matter?

> > Per the failure message I restarted the build of devel/llvm13 with 
> > make -DBATCH MAKE_JOBS_UNSAFE=YES > make.log &
> 
> Just like -DBATCH is for ports, not buildworld buildkernel,
> MAKE_JOBS_UNSAFE= is for ports, not buildworld buildkernel,
> at least if I understand right.
>
This was a ports build on the Pi4. The restart is running single-thread
and quite slow, I'm tempted to stop it unless a failure would be useful.
 
> 
> >>> However, restarting buildworld using -j1 appears to have worked past
> >>> the former point of failure.
> >>
[this on stable/13 pi3] 
> >> Hmm. That usually means one (or both) of two things was involved
> >> in the failure:
> >> 
> >> A) a build race where something is not (fully) ready when
> >>   it is used
> >> 
> >> B) running out of resources, such as RAM+SWAP
> >> 
> > 
> > The stable/13 machine is short of swap; it has only 2 GB, which
> > used to be enough.
> 
> So RAM+SWAP is 1 GiByte + 2 GiByte, so 3 GiByte on that
> RPi3*? (That would have been good to know earlier, such
> as for my attempts at reproduction.)
>
Correct, 3GB RAM+swap. Didn't realize it would turn out to 
be important, sorry!

> -j<What?> for the RPi3* when it was failing?
>
-j4, but I think it also failed at -j2. 
> Did you havae failures with the .cpp and .sh (so no
> make use involved) in the RAM+SWAP context?
> 
Using the .cpp and .sh file on a Pi3 with 2 GB swap 
running stable/13 there was a consistent failure.

Using the .cpp and .sh files on a Pi3 with 7GB swap
there was no failure. 

Using a build of /usr/ports/devel/llvm13 as a test the
build failed even with 8 GB of RAM and 8 GB of swap.

> > Maybe that's the problem, but having an error 
> > report that says it's a segfault is a confusing diagnostic. 
> > 
> >> But, as I understand, you were able to use a .cpp and
> >> .sh file pair that had been produced to repeat the
> >> problem on the RPi3B --and that would not have been a
> >> parallel-activity context.
> >> 
> > 
> > To be clear, the reproduction was on the same stable/13 that
> > reported the original failure. An attempt at reproduction
> > on a different Pi3 running -current ran without any errors.
> > Come to think of it, that machine had more swap, too.
> 
> How much swap?
> 
Two swap partitions, 3.6 GB and 4 GB, both in use.

> 
> At this point, I expect that the failure was tied to the
> RAM+SWAP totaling to 3 GiBytes.
>

That seems likely, or at least a reasonable suspicion. 

> Knowing that context we might have a reproducible report
> that can be made based on the .cpp and .sh files, where
> restricting the RAM+SWAP use allowed is part of the
> report.
>
 
There seem to be some other reports of clang using unreasonable
amounts of memory, for example 
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=261341

A much older report that looks vaguely similar (out of memory
reported as segfault)
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=172576
It's not arm-related and dates from 2012 but is still open.

I'll try to repeat some of the tests using the logging script
used previously. Right now it contains:

#!/bin/sh
while true
sysctl hw.regulator.5v0.min_uvolt ; do vmstat ; gstat -abd -I 10s ; date ; swapinfo ; tail \
-n 2 /var/log/messages ; netstat -m | grep "mbuf clusters" ; ps -auxd -w -w
done

Changes to the script are welcome, the output is voluminous.

Thanks for reading!

bob prohaska