Re: devel/llvm13 failed to reclaim memory on 8 GB Pi4 running -current

From: Mark Millard <marklmi_at_yahoo.com>
Date: Thu, 27 Jan 2022 21:35:04 UTC
On 2022-Jan-27, at 12:12, Mark Millard <marklmi@yahoo.com> wrote:

> On 2022-Jan-27, at 11:31, Mark Millard <marklmi@yahoo.com> wrote:
> 
>> On 2022-Jan-27, at 08:45, bob prohaska <fbsd@www.zefox.net> wrote:
>> 
>>> Attempts to compile devel/llvm13 on a Pi4 running -current (updated
>>> on 20220126) with 8 GB of RAM and 8 GB of swap has failed on two occasions using 
>>> make -DBATCH > make.log & 
>>> in /usr/ports/devel/llvm13 using the system compiler. The system is
>>> self-hosted. 
> 
> Context question: ZFS? UFS?
> 
> (In things involving memory usage issues, knowing which is
> always appropriate because of differences in memory use
> patterns.)
> 
>>> The first failure reported clang error 139, but the second
>>> was different, reporting only:
>>> FAILED: tools/flang/lib/Evaluate/CMakeFiles/obj.FortranEvaluate.dir/check-expression.cpp.o
>>> along with a console report of
>>> +swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1258432, size: 4096
>>> +swap_pager: indefinite wait buffer: bufobj: 0, blkno: 627221, size: 8192
>>> +swap_pager: indefinite wait buffer: bufobj: 0, blkno: 240419, size: 4096
>>> +swap_pager: out of swap space
>> 
>> In recent builds, such as yours, the above "out of swap" is a
>> misnomer but is very interesting for what it is actually about.
>> 
>> Mark Johnston later wrote on 2022-Jan-15 about his "git:
>> 4a864f624a70 - main - vm_pageout: Print a more accurate message
>> to the console before an OOM kill" that produced the above report
>> of "out of swap space":
>> 
>> QUOTE
>> Hmm, those cases should likely be changed from "out of swap space" to
>> "failed to allocate swap metadata" or something like that.
>> END QUOTE
>> 
>> Your context proves the metadata problem really happens, so
>> the messaging should be fixed to not be misleading.
>> 
>> In my builds I've code that is more explicit:
>> 
>> diff --git a/sys/vm/swap_pager.c b/sys/vm/swap_pager.c
>> index 01cf9233329f..280621ca51be 100644
>> --- a/sys/vm/swap_pager.c
>> +++ b/sys/vm/swap_pager.c
>> @@ -2091,6 +2091,7 @@ swp_pager_meta_build(vm_object_t object, vm_pindex_t pindex, daddr_t swapblk)
>>                                  0, 1))
>>                                      printf("swap blk zone exhausted, "
>>                                          "increase kern.maxswzone\n");
>> +                               printf("swp_pager_meta_build: swap blk uma zone exhausted\n");
>>                              vm_pageout_oom(VM_OOM_SWAPZ);
>>                              pause("swzonxb", 10);
>>                      } else
>> @@ -2121,6 +2122,7 @@ swp_pager_meta_build(vm_object_t object, vm_pindex_t pindex, daddr_t swapblk)
>>                                  0, 1))
>>                                      printf("swap pctrie zone exhausted, "
>>                                          "increase kern.maxswzone\n");
>> +                               printf("swp_pager_meta_build: swap pctrie uma zone exhausted\n");
>>                              vm_pageout_oom(VM_OOM_SWAPZ);
>>                              pause("swzonxp", 10);
>>                      } else
>> 
>> The "metadata" is the "swap blk uma zone" and "swap pctrie
>> uma zone". Unfortuantely, which got the failure is not still
>> indicated in the standard builds.
>> 
>>> +swp_pager_getswapspace(12): failed
>>> +pid 61012 (c++), jid 0, uid 0, was killed: failed to reclaim memory
>> 
>> Abssent being able to swap, it tries to reclaim --and that
>> too failed. That finally leads to the kills.
>> 
>>> Swap use peaked a little over 50%.
>> 
>> So at around 50% "swap blk uma zone" and/or "swap pctrie uma zone"
>> had problems, probably fragmentation related problems.
>> 
>>> After the first failure a restart
>>> of make using MAKE_JOBS_UNSAFE=yes ran to completion with one thread.
>>> 
>>> A copy of the build log, logging script and other notes is at
>>> http://www.zefox.net/~fbsd/rpi4/20220127/
>>> 
>>> Clang error 139 has been seen several times during make buildworld on a Pi3 running
>>> stable/13 with 2 GB of swap as well. Perhaps the two failures are related. The Pi3 
>>> failures didn't report out of swap, all were clang error 139 with "failed to reclaim 
>>> memory". Even with only 1 thread (j1) the failure reproduced.

So far as I know stable/13 does not yet have the changes
to the messaging about kills for failures to reclaim
memory: still like it used to be for so long. ONly main
has the 

This makes an unmodified stable/13 messages not be
nearly so interesting when they are produced. It will
be this way until something based on "git: 4a864f624a70
- main - vm_pageout: Print a more accurate message to
the console before an OOM kill" is in place in
stable/13 (or somewhat analogous local changes are in
place).



I'm updating the media for my 8 GiByte RPi4B
configuration to be based on the bectl environment
for main being nearly a copy of (line split for
readability):

# uname -apKU
FreeBSD CA72_16Gp_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #37
main-n252475-e76c0108990b-dirty: Sat Jan 15 21:53:08 PST 2022
root@CA72_16Gp_ZFS:/usr/obj/BUILDs/main-CA72-nodbg-clang/usr/main-src/arm64.aarch64/sys/GENERIC-NODBG-CA72
arm64 aarch64 1400047 1400047

That has my variant of Mark Johnston's new messaging and
my additional messages as well. So on failure, it should
report which metadata got the problem.

The update also includes adding a 8 GiByte swap partition
as an alternative. I'll temporarily have it configured to
boot using just that swap partition. Another thing is
that I'll remove my usual options for devel/llvm13 so that
just defaults are used, including building of flang.

So I hope to reproduce the problem in my context and to
be able to report which of the two metadata caused the
metadata driven messaging.

It does take a while to synchronize the media involved
to be based on the CA72_16Gp_ZFS media and building
devel/llvm13 on a RPi4B takes a while. But I'll report
once I have the console messages (or whatever happens).


>> Note in your report above: obj.FortranEvaluate.dir
>> 
>> If you use the options to disable building flang (a.k.a.,
>> the Fortran compiler build), your builds on the RPi4B
>> will likely work in the current configuration.
>> 
>> But it looks like you have identified a test context
>> for the "swap blk uma zone" and "swap pctrie uma zone"
>> handling.


===
Mark Millard
marklmi at yahoo.com