Re: Troubles building world on stable/13
Date: Wed, 26 Jan 2022 00:08:15 UTC
On 2022-Jan-25, at 14:17, bob prohaska <fbsd@www.zefox.net> wrote: > On Tue, Jan 25, 2022 at 12:49:02PM -0800, Mark Millard wrote: >> On 2022-Jan-25, at 10:08, bob prohaska <fbsd@www.zefox.net> wrote: >> >>> On Tue, Jan 25, 2022 at 09:13:08AM -0800, Mark Millard wrote: >>>> >>>> -DBATCH ? I'm not aware of there being any use of that symbol. >>>> Do you have a documentation reference for it so that I could >>>> read about it? >>>> >>> It's a switch to turn off dialog4ports. I can't find the reference >>> now. Perhaps it's been deprecated? A name like -DUSE_DEFAULTS would >>> be easier to understand anyway. >> >> I've never had buildworld buildkernel or the like try to use >> dialog4ports. I've only had port building use it. buildworld >> and buildkernel can be done with no ports installed at all. >> dialog4ports is a port. >> > > The attempt to build devel/llvm13 under stable/13 was done under ports. > Thus the -DBATCH, to avoid manual intervention. I missed the later reference to devel/llvm13 as applying to the above and then later confused the contexts, effectively ignoring devel/llvm13 completely. Sorry. >> I think -DBATCH was ignored for the activity at hand. >> >>> On a whim, I tried building devel/llvm13 on a Pi4 running -current with >>> 8 GB of RAM and 8 GB of swap. To my surprise, that stopped with: >>> nemesis.zefox.com kernel log messages: >>> +FreeBSD 14.0-CURRENT #26 main-5025e85013: Sun Jan 23 17:25:31 PST 2022 >>> +swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1873450, size: 4096 >>> +swap_pager: indefinite wait buffer: bufobj: 0, blkno: 521393, size: 4096 >>> +swap_pager: indefinite wait buffer: bufobj: 0, blkno: 209826, size: 12288 >>> +swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1717218, size: 24576 >>> +pid 56508 (c++), jid 0, uid 0, was killed: failed to reclaim memory >>> >>> On an 8GB machine, that seems strange. >> >> -j<What?> build? -j4 ? >> > Since this too was a port build, I let ports decide. It settled on 4. > >> Were you watching the swap usage in top (or some such)? >> > > Top was running but the failure happened overnight. Not expecting > it to fail, I didn't keep a log of swapping activity. The message > above was in the next morning's log email. > >> Note: The "was killed" related notices have been improved >> in main, but there is a misnomer case about "out of swap" >> (last I checked). >> > >> An environment that gets "swap_pager: indefinite wait buffer" >> notices is problematical and the I/O delays for the virtual >> memory subsystem can lead to kills, if I understand right. >> >> But, if I remember right, the actual message for a directly >> I/O related kill is now different. >> > > In this case the message was "unable to reclaim memory", a > message I've not seen before. Yea, it is one, more accurate wording of the old out of swap notices --probably covering most occurrences. >> I think that being able to reproduce this case could be >> important. I probably can not because I'd not get the >> "swap_pager: indefinite wait buffer" in my hardware >> context. I was thinking buildworld buildkernel here. I got the context wrong. I'll eventually do a devel/llvm13 build on the 8 GiByte RPi4B with my patched top monitoring various "maximum observed" figures. > If it's relevant, the case of /usr/ports/devel/llvm13 seems like > the most expedient test, since it did fail with realistic amounts > of memory and swap. I gather that there's a certain amount of > self-recompilation in buildworld, is that true of the port version? > Does it matter? > >>> Per the failure message I restarted the build of devel/llvm13 with >>> make -DBATCH MAKE_JOBS_UNSAFE=YES > make.log & >> >> Just like -DBATCH is for ports, not buildworld buildkernel, >> MAKE_JOBS_UNSAFE= is for ports, not buildworld buildkernel, >> at least if I understand right. >> > This was a ports build on the Pi4. The restart is running single-thread > and quite slow, I'm tempted to stop it unless a failure would be useful. Again an example of my not switching context correctly. Sorry. >>>>> However, restarting buildworld using -j1 appears to have worked past >>>>> the former point of failure. >>>> > [this on stable/13 pi3] >>>> Hmm. That usually means one (or both) of two things was involved >>>> in the failure: >>>> >>>> A) a build race where something is not (fully) ready when >>>> it is used >>>> >>>> B) running out of resources, such as RAM+SWAP >>>> >>> >>> The stable/13 machine is short of swap; it has only 2 GB, which >>> used to be enough. >> >> So RAM+SWAP is 1 GiByte + 2 GiByte, so 3 GiByte on that >> RPi3*? (That would have been good to know earlier, such >> as for my attempts at reproduction.) >> > Correct, 3GB RAM+swap. Didn't realize it would turn out to > be important, sorry! Do not know yet if it would have helped reproduction of the problem. But I now know that I should try for something that would give evidence about getting near or over 3 GiBytes. >> -j<What?> for the RPi3* when it was failing? >> > -j4, but I think it also failed at -j2. >> Did you havae failures with the .cpp and .sh (so no >> make use involved) in the RAM+SWAP context? >> > Using the .cpp and .sh file on a Pi3 with 2 GB swap > running stable/13 there was a consistent failure. Ahh, a simpler, quicker test context/case. So that is likely what I'd look into. > Using the .cpp and .sh files on a Pi3 with 7GB swap > there was no failure. > > Using a build of /usr/ports/devel/llvm13 as a test the > build failed even with 8 GB of RAM and 8 GB of swap. > >>> Maybe that's the problem, but having an error >>> report that says it's a segfault is a confusing diagnostic. >>> >>>> But, as I understand, you were able to use a .cpp and >>>> .sh file pair that had been produced to repeat the >>>> problem on the RPi3B --and that would not have been a >>>> parallel-activity context. >>>> >>> >>> To be clear, the reproduction was on the same stable/13 that >>> reported the original failure. An attempt at reproduction >>> on a different Pi3 running -current ran without any errors. >>> Come to think of it, that machine had more swap, too. >> >> How much swap? >> > Two swap partitions, 3.6 GB and 4 GB, both in use. So that is the devel/llvm13 example, not buildworld buildkernel, not the .cpp and .sh combination. >> >> At this point, I expect that the failure was tied to the >> RAM+SWAP totaling to 3 GiBytes. >> > > That seems likely, or at least a reasonable suspicion. > >> Knowing that context we might have a reproducible report >> that can be made based on the .cpp and .sh files, where >> restricting the RAM+SWAP use allowed is part of the >> report. >> > > There seem to be some other reports of clang using unreasonable > amounts of memory, for example > https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=261341 > > A much older report that looks vaguely similar (out of memory > reported as segfault) > https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=172576 > It's not arm-related and dates from 2012 but is still open. > > I'll try to repeat some of the tests using the logging script > used previously. Right now it contains: > > #!/bin/sh > while true > sysctl hw.regulator.5v0.min_uvolt ; do vmstat ; gstat -abd -I 10s ; date ; swapinfo ; tail \ > -n 2 /var/log/messages ; netstat -m | grep "mbuf clusters" ; ps -auxd -w -w > done > > Changes to the script are welcome, the output is voluminous. I'll probably not get to experimenting with this for some time. === Mark Millard marklmi at yahoo.com