Silent hang in buildworld, was Re: Invoking -v for clang during buildworld
bob prohaska
fbsd at www.zefox.net
Mon Jan 18 01:50:09 UTC 2021
On Sun, Jan 17, 2021 at 12:30:51PM -0800, Mark Millard wrote:
>
>
> On 2021-Jan-17, at 09:40, bob prohaska <fbsd at www.zefox.net> wrote:
>
> > On Sat, Jan 16, 2021 at 03:04:04PM -0800, Mark Millard wrote:
> >>
> >> Other than -j1 style builds (or equivalent), one pretty much
> >> always needs to go looking around for a non-panic failure. It
> >> is uncommon for all the material to be together in the build
> >> log in such contexts.
> >
> > Running make cleandir twice and restarting -j4 buildworld brought
> > the process full circle: A silent hang, no debugger response, no
> > console warnings. That's what sent me down the rabbit hole of make
> > without clean, which worked at least once...
>
> Unfortunately, such a hang tends to mean that log files and such
> were not completely written out to media. We do not get to see
> evidence of the actual failure time frame, just somewhat before.
> (compiler/linker output and such can have the same issues of
> ending up with incomplete updates.)
>
> So, pretty much my notes are unlikely to be strongly tied to
> any solid evidence: more like alternatives to possibly explore
> that could be far off the mark.
>
> It is not clear if you were using:
>
> LDFLAGS.lld+= -Wl,--threads=1
>
> or some such to limit the multi-thread linking and its memory.
No, I wasn't trying to limit ld.lld thread number.
> I'll note that if -j4 gets 4 links running in parallel it used
> to be each could have something like 5 threads active on a 4
> core machine, so 20 or so threads. (I've not checked llvm11's
> lld behavior. It might avoid such for defaults.)
>
> You have not reported any testing of -j2 or -j3 so far, just
> -j4 . (Another way of limiting memory use, power use, temperature,
> etc. .)
>
Not recently, simply because it's so slow to build. On my "production"
armv7 machines running stable/12 I do use -j2. But, they get updated
only a couple times per year, when there's a security issue.
> You have not reported if your boot complained about the swap
> space size or if you have adjusted related settings to make
> non-default tradeoffs for swap amanagment for these specific
> tests. I recommend not tailoring and using a swap size total
> that is somewhat under what starts to complain when there is
> no tailoring.
>
Both Pi2 and Pi3 have been complaining about too much swap
since I first got them. Near as can be told it's never been
a demonstrated problem, thus far. Now, as things like LLVM
get bigger and bigger, it seems possible excess swap might
cause, or obscure, other problems. For the Pi2 I picked 2
GB from the old "2x physical RAM" rule.
>
> > The residue of the top screen shows
> >
> > last pid: 63377; load averages: 4.29, 4.18, 4.15 up 1+07:11:07 04:46:46
> > 60 processes: 5 running, 55 sleeping
> > CPU: 70.7% user, 0.0% nice, 26.5% system, 2.8% interrupt, 0.0% idle
> > Mem: 631M Active, 4932K Inact, 92M Laundry, 166M Wired, 98M Buf, 18M Free
> > Swap: 2048M Total, 119M Used, 1928M Free, 5% Inuse, 16K In, 3180K Out
> > packet_write_wait: Connection to 50.1.20.26 port 22: Broken pipe
> > bob at raspberrypi:~ $ ssh www.zefox.com RES STATE C TIME WCPU COMMAND
> > ssh: connect to host www.zefox.com port 22: Connection timed out86.17% c++
> > bob at raspberrypi:~ $ 1 99 0 277M 231M RUN 0 3:26 75.00% c++
> > 63245 bob 1 99 0 219M 173M CPU0 0 2:10 73.12% c++
> > 62690 bob 1 98 0 354M 234M RUN 3 9:42 47.06% c++
> > 63377 bob 1 30 0 5856K 2808K nanslp 0 0:00 3.13% gstat
> > 38283 bob 1 24 0 5208K 608K wait 2 2:00 0.61% sh
> > 995 bob 1 20 0 6668K 1184K CPU3 3 8:46 0.47% top
> > 990 bob 1 20 0 12M 1060K select 2 0:48 0.05% sshd
> > ....
>
> This does not look like ld was in use as of the last top
> display update's content. But the time between reasonable
> display updates is fairly long relative to CPU activity
> so it is only suggestive.
>
> > [apologies for typing over the remnants]
> >
> > I've put copies of the build and swap logs at
> >
> > http://www.zefox.net/~fbsd/rpi2/buildworld/
> >
> > The last vmstat entry (10 second repeat time) reports:
> > procs memory page disks faults cpu
> > r b w avm fre flt re pi po fr sr da0 sd0 in sy cs us sy id
> > 4 0 14 969160 91960 685 2 2 1 707 304 0 0 11418 692 1273 45 5 50
> >
> > Does that point to the memory exhaustion suggested earlier in the thread?
> > At this point /boot/loader.conf contains vm.pfault_oom_attempts="-1", but
> > that's a relic of long-ago attempts to use USB flash for root and swap.
> > Might removing it stimulate more warning messages?
> >
>
> vm.pfault_oom_attempts="-1" should only be used in contexts where
> running out of swap will not happen. Otherwise a deadlocked system
> can result if it does run out of swap. (Run-out has more senses the
> just the swap partition being fully used: other internal resources
> for keeping track of the swap can run into its limits.) I've no
> evidence that the -1 was actually a problem.
>
> I do not find any 1000+ ms/w or ms/r figures in swapscript.log .
> I found 3 examples of a little under 405 (on sdda0*), 3 between
> 340 and 345 (da0*), 4 in the 200s (da0*), under 60 in the 100s
> (da0*). It does not look to me like the recorded part had problems
> with the long latencies that you used to have happen.
>
> So I've not found any specific evidence about what led to the
> hangup. So my earlier questions/suggestions are basically
> arbitrary and I would not know what to do with any answers
> to the questions.
>
> The only notes that are fairly solid are about the hangup leading
> to there being some files that were likely incompletely updated
> (logs, compiler output files, etc.).
>
The notion that log files might be truncated didn't didn't register
until you brought it up.
The obvious things to try seem to be:
Disable vm.pfault_oom_attempts="-1"
Decrease swap partition size
Try using WITH_META_MODE
WITH_META_MODE seems relatively easy; just add -DWITH_META_MODE to the make
command line for buildworld and add filemon_load="YES" to boot/loader.conf,
but I wonder about the burdens it'll impose on CPU and disk space/throughput.
There's nothing to lose at this stage, but I'm all ears if there's a better
approach.
Many thanks for your patient good counsel!
bob prohaska
More information about the freebsd-arm
mailing list