Re: Any known way to build devel/llvm* ( such as devel/llvm19 ) with --threads=1 for its linker activity during the build?

From: Mark Millard <marklmi_at_yahoo.com>
Date: Mon, 05 Aug 2024 09:09:24 UTC
On Aug 5, 2024, at 00:44, meloun.michal@gmail.com wrote:

> On 05.08.2024 9:27, Mark Millard wrote:
>> On Aug 5, 2024, at 00:15, Mark Millard <marklmi@yahoo.com> wrote:
>>> On Aug 4, 2024, at 22:53, Michal Meloun <meloun.michal@gmail.com> wrote:
>>> 
>>>> On 04.08.2024 23:31, Mark Millard wrote:
>>>>> On Aug 3, 2024, at 23:07, Mark Millard <marklmi@yahoo.com> wrote:
>>>>>> My recent attempts to build devel/llvm18 and devel/llvm19 in an armv7 context (native or aarch64-as-armv7) have had /usr/bin/ld failures that stop the build and report as:
>>>>>> 
>>>>>> LLVM ERROR: out of memory
>>>>>> Allocation failed
>>>>>> 
>>>>>> (no system OOM activity or notices, so just a process size/fragmentation issue, or so I would expect).
>>>>>> 
>>>>>> On native armv7 I also had rust 1.79.0 fail that way so --but aarch64-as-armv7 built it okay.
>>>>>> 
>>>>>> I'm curious if --threads=1 use for the linker might allow the devel/llvm* builds to complete at this point. Similarly for rust. (top showed that the ld activity was multi-threaded.)
>>>>>> 
>>>>>> Note: The structure of the poudriere-devel based native build attempts is historical and it used to work. Similarly for the aarch64-as-armv7 based build attempts. For now I'd just be exploring changes that might allow much of my historical overall structure to still work. But I expect that things are just growing to the point building is starting to be problematical with process address spaces that are bounded by a limit somewhat under 4 GiBytes.
>>>>>> 
>>>>>> 
>>>>>> Native armv7 was a 2 GiByte OrangePi+ 2ed (4 cores) that had
>>>>>> at boot time:
>>>>>> 
>>>>>> AVAIL_RAM+SWAP == 1958Mi+3685Mi == 5643Mi
>>>>>> 
>>>>>> and later had "Max(imum)Obs(erved)" figures:
>>>>>> 
>>>>>> Mem: . . .,
>>>>>> 1728Mi MaxObsActive, 275192Ki MaxObsWired, 1952Mi MaxObs(Act+Wir+Lndry)
>>>>>> 
>>>>>> Swap: 3685Mi Total, . . .,
>>>>>> 1535Mi MaxObsUsed, 3177Mi MaxObs(Act+Lndry+SwapUsed),
>>>>>> 3398Mi MaxObs(A+Wir+L+SU), 3449Mi (A+W+L+SU+InAct)
>>>>>> 
>>>>>> 
>>>>>> The aarch64-as-armv7 was a Win DevKit 2023 that has 8 cores and:
>>>>>> 
>>>>>> AVAIL_RAM+SWAP == 31311Mi+120831Mi == 152142Mi
>>>>>> 
>>>>>> So lots of 4 GiByte or smaller processes would fit.
>>>>>> 
>>>>> Absent finding a way to get --threads=1 to be what is used, I
>>>>> made the following crude way to test, built it, installed it
>>>>> in the armv7 directory tree used for aarch64-as-armv7, and
>>>>> then started an aarch64-as-armv7 test of building devel/llvm19
>>>>> to see what the consequences are (leading whitespace details
>>>>> might not be preserved):
>>>>> # git -C /usr/main-src/ diff contrib/llvm-project/
>>>>> diff --git a/contrib/llvm-project/lld/ELF/Driver.cpp b/contrib/llvm-project/lld/ELF/Driver.cpp
>>>>> index 8b2c32b15348..299daf7dd6fa 100644
>>>>> --- a/contrib/llvm-project/lld/ELF/Driver.cpp
>>>>> +++ b/contrib/llvm-project/lld/ELF/Driver.cpp
>>>>> @@ -1587,6 +1587,9 @@ static void readConfigs(opt::InputArgList &args) {
>>>>>             arg->getValue() + "'");
>>>>>     parallel::strategy = hardware_concurrency(threads);
>>>>>     config->thinLTOJobs = v;
>>>>> +  } else if (sizeof(void*) <= 4) {
>>>>> +    log("set maximum concurrency to 1, specify --threads= to change");
>>>>> +    parallel::strategy = hardware_concurrency(1);
>>>>>   } else if (parallel::strategy.compute_thread_count() > 16) {
>>>>>     log("set maximum concurrency to 16, specify --threads= to change");
>>>>>     parallel::strategy = hardware_concurrency(16);
>>>>> Basically, if the process address space has to be "small", avoid
>>>>> any default memory use tradeoffs that multi-threading the linker
>>>>> might involve --even if that means taking more time.
>>>>> We will see if:
>>>>> [00:00:33] [07] [00:00:00] Building   devel/llvm19@default | llvm19-19.1.0.r1
>>>>> still fails to build as armv7 vs. if the change leads it to
>>>>> manage to build as armv7.
>>>>> ===
>>>>> Mark Millard
>>>>> marklmi at yahoo.com
>>>> 
>>>> I can build llvm18 and rust 1.79 on native armv7  without problems - on Tegra TK1, without poudriere and on the ufs filesystem. IMHO poudriere is unusable on 32bit systems.
>>> 
>>> On Windows DevKit 2023 in a armv7 chroot I can build rust 1.79.0
>>> as well. I've not tried a recent devel/llvm18 in that context,
>>> just devel/llvm19 . An armv7 process in this context can use
>>> about 1 GiByte more memory space than on the OrangePi+ 2ed. (See
>>> later program example outputs.)
>>> 
>>> Previously, devel/llvm18-18.1.7 had built fine some time back.
>>> So I'm trying the modern 18.1.8_1 now on the Windows DevKit 2023.
>>> But this is with forcing of --threads=1 for lld: same context as
>>> the recent devel/llvm19 exploration.
>>> 
>>> Note: UFS context, not ZFS.
>>> 
>>> How does the Tegra TK1 context compare for the following
>>> program and the example command?
>>> 
>>> OrangePi+ 2ed (so: armv7 native with 2 GiBytes of RAM):
>>> 
>>> # more process_size.c
>>> // cc -std=c11 process_size.c
>>> // ./a.out 268435456 268435456 268435456 268435456 268435456 268435456 268435456 268435456 268435456 268435456 268435456 268435456 268435456 134217728 67108864 33554432 16777216 8388608 4194304 2097152 1048576
>>> 
>>> #include <malloc.h>
>>> #include <errno.h>
>>> #include <stdio.h>
>>> #include <stdlib.h>
>>> #include <limits.h>
>>> 
>>> int main(int argc, char *argv[])
>>> {
>>> size_t totalsize= 0u;
>>> for (int i = 1; i < argc; ++i) {
>>>   errno = 0;
>>>   size_t size = strtoul(argv[i],NULL,0);
>>>   void *p = malloc(size);
>>>   if (p) totalsize += size;
>>>   printf("malloc(%zu) = %p [errno = %d]\n", size, p, errno);
>>> }
>>> printf("approx. total, a lower bound: %zu MiBytes\n", totalsize/1024u/1024u);
>>> return 0;
>>> }
>>> # cc -std=c11 process_size.c
>>> # ./a.out 268435456 268435456 268435456 268435456 268435456 268435456 268435456 268435456 268435456 268435456 268435456 268435456 268435456 134217728 67108864 33554432 16777216 8388608 4194304 2097152 1048576
>>> malloc(268435456) = 0x20800180 [errno = 0]
>>> malloc(268435456) = 0x30801980 [errno = 0]
>>> malloc(268435456) = 0x40802640 [errno = 0]
>>> malloc(268435456) = 0x50803600 [errno = 0]
>>> malloc(268435456) = 0x608048c0 [errno = 0]
>>> malloc(268435456) = 0x70805140 [errno = 0]
>>> malloc(268435456) = 0x80806580 [errno = 0]
>>> malloc(268435456) = 0x90807780 [errno = 0]
>>> malloc(268435456) = 0xa0808700 [errno = 0]
>>> malloc(268435456) = 0x0 [errno = 12]
>>> malloc(268435456) = 0x0 [errno = 12]
>>> malloc(268435456) = 0x0 [errno = 12]
>>> malloc(268435456) = 0x0 [errno = 12]
>>> malloc(134217728) = 0xb0809a00 [errno = 0]
>>> malloc(67108864) = 0x0 [errno = 12]
>>> malloc(33554432) = 0xb880a5c0 [errno = 0]
>>> malloc(16777216) = 0xba80b0c0 [errno = 0]
>>> malloc(8388608) = 0x0 [errno = 12]
>>> malloc(4194304) = 0x0 [errno = 12]
>>> malloc(2097152) = 0xbb80c180 [errno = 0]
>>> malloc(1048576) = 0xbba0de80 [errno = 0]
>>> approx. total, a lower bound: 2483 MiBytes
>>> 
>>> 
>>> Same program with same command on Windows DevKit 2023 in
>>> armv7 chroot (aarch64-as-armv7 with 32 GiBytes of RAM):
>>> 
>>> # ./a.out 268435456 268435456 268435456 268435456 268435456 268435456 268435456 268435456 268435456 268435456 268435456 268435456 268435456 134217728 67108864 33554432 16777216 8388608 4194304 2097152 1048576
>>> malloc(268435456) = 0x20800b00 [errno = 0]
>>> malloc(268435456) = 0x30801600 [errno = 0]
>>> malloc(268435456) = 0x40802cc0 [errno = 0]
>>> malloc(268435456) = 0x50803c80 [errno = 0]
>>> malloc(268435456) = 0x608042c0 [errno = 0]
>>> malloc(268435456) = 0x70805b00 [errno = 0]
>>> malloc(268435456) = 0x808063c0 [errno = 0]
>>> malloc(268435456) = 0x90807580 [errno = 0]
>>> malloc(268435456) = 0xa0808b40 [errno = 0]
>>> malloc(268435456) = 0xb0809980 [errno = 0]
>>> malloc(268435456) = 0xc080abc0 [errno = 0]
>>> malloc(268435456) = 0xd080ba00 [errno = 0]
>>> malloc(268435456) = 0xe080cc80 [errno = 0]
>>> malloc(134217728) = 0xf080d700 [errno = 0]
>>> malloc(67108864) = 0x0 [errno = 12]
>>> malloc(33554432) = 0xf880eb40 [errno = 0]
>>> malloc(16777216) = 0xfa80fc00 [errno = 0]
>>> malloc(8388608) = 0x0 [errno = 12]
>>> malloc(4194304) = 0xfb810840 [errno = 0]
>>> malloc(2097152) = 0xfbc117c0 [errno = 0]
>>> malloc(1048576) = 0xfbe12940 [errno = 0]
>>> approx. total, a lower bound: 3511 MiBytes
>>> 
>>> 
>>> Note: If the Tegra TK1 in question has more than
>>> 4 GiBytes  of RAM, the command line should explore
>>> more than the example that I used.
>>> 
>>> 
>>> Note: I've used the program for other patterns of
>>> allocations. That is why it is not just a fixed
>>> exploration algorithm.
>>> 
>>> 
>>> As for poudriere-devel, I find it useful, even on
>>> the OrangePi+ 2ed. But mostly that is a rare run
>>> that is checking on how well the handling goes for
>>> the 2 GiByte of RAM context (with notable SWAP for
>>> the size of RAM). In other words, monitoring the
>>> growth in a context that will break sooner than
>>> my other contexts generally would. The tests take
>>> days overall, most of the time being for rust and
>>> a llvm* .
>>> 
>>> Historically I've been able to have 2 builders,
>>> each with MAKE_JOBS_NUMBER_LIMIT=2 , so all 4
>>> cores in use building lang/rust and a devel/llvm*
>>> at the same time successfully in poudriere-devel
>>> on the 2 GiByte OrangePi+ 2ed. (This was before
>>> recently imposing --threads=1 experiments,
>>> given the recent build failures.)
>> I should have noted that my normal devel/llvm* builds
>> on aarch64 and armv7 avoid building: BE_AMDGPU and
>> MLIR . They also target BE_NATIVE instead of
>> BE_STANDARD . (aarch64 BE_NATIVE includes armv7 as
>> well.)
>> ===
>> Mark Millard
>> marklmi at yahoo.com
> Tegra has 4 Cortex-A15 cores and 2 GB of RAM.

OrangePi+ 2ed: Cortex-A7 with 4 cores and 2 GiBytes of RAM.

I wonder if the 2483 MiBytes would end up being about the
same on the Tegra variation indicated.

> All ports are built with default options. The only non-standard item is the swap size -> I have 16GB of swap on a swap partition on the SSD.

Wow, 16 GiBYtes of swap space for 2 GiBytes of RAM. I guess
when the swap is added that you get a notice-pair of the
structure:

QUOTE
warning: total configured swap (. . . pages) exceeds maximum recommended amount (. . . pages).
warning: increase kern.maxswzone or reduce amount of swap.
END QUOTE

with a rather large difference between the two ". . ." figures.

Do you make other adjustments to deal with the otherwise-reported
potential mistuning? It appears to make tradeoffs in the kernel
internal memory handling, if I understand right.

> But I guess that's not important in this case.

At least for my context, it appears that memory allocations
are failing to find a big enough free area inside the
process's address space --without running out of system
RAM+SWAP space overall.

For the OrangePi+ 2ed ( and devel/llvm18 18.1.7 ) it was
during the earlier linker run for:

FAILED: bin/lli-child-target 
. . .
LLVM ERROR: out of memory
Allocation failed

That much finished just fine on the Windows DevKit
2023 used via a armv7 jail ( devel/llvm18 18.1.8_1 ).
The failure point was in a later link ( matching what
I saw via devel/llvm19 ).

> I just started build of llvm19 - but it takes few hours to complete..

Probably fewer hours than on the OrangePi+ 2ed but
more than on the Windows DevKit 2023 (if they were
completing, anyway).

===
Mark Millard
marklmi at yahoo.com