Re: Any known way to build devel/llvm* ( such as devel/llvm19 ) with --threads=1 for its linker activity during the build?

From: <meloun.michal_at_gmail.com>
Date: Mon, 05 Aug 2024 15:57:12 UTC

On 05.08.2024 11:09, Mark Millard wrote:
> On Aug 5, 2024, at 00:44, meloun.michal@gmail.com wrote:
> 
>> On 05.08.2024 9:27, Mark Millard wrote:
>>> On Aug 5, 2024, at 00:15, Mark Millard <marklmi@yahoo.com> wrote:
>>>> On Aug 4, 2024, at 22:53, Michal Meloun <meloun.michal@gmail.com> wrote:
>>>>
>>>>> On 04.08.2024 23:31, Mark Millard wrote:
>>>>>> On Aug 3, 2024, at 23:07, Mark Millard <marklmi@yahoo.com> wrote:
>>>>>>> My recent attempts to build devel/llvm18 and devel/llvm19 in an armv7 context (native or aarch64-as-armv7) have had /usr/bin/ld failures that stop the build and report as:
>>>>>>>
>>>>>>> LLVM ERROR: out of memory
>>>>>>> Allocation failed
>>>>>>>
>>>>>>> (no system OOM activity or notices, so just a process size/fragmentation issue, or so I would expect).
>>>>>>>
>>>>>>> On native armv7 I also had rust 1.79.0 fail that way so --but aarch64-as-armv7 built it okay.
>>>>>>>
>>>>>>> I'm curious if --threads=1 use for the linker might allow the devel/llvm* builds to complete at this point. Similarly for rust. (top showed that the ld activity was multi-threaded.)
>>>>>>>
>>>>>>> Note: The structure of the poudriere-devel based native build attempts is historical and it used to work. Similarly for the aarch64-as-armv7 based build attempts. For now I'd just be exploring changes that might allow much of my historical overall structure to still work. But I expect that things are just growing to the point building is starting to be problematical with process address spaces that are bounded by a limit somewhat under 4 GiBytes.
>>>>>>>
>>>>>>>
>>>>>>> Native armv7 was a 2 GiByte OrangePi+ 2ed (4 cores) that had
>>>>>>> at boot time:
>>>>>>>
>>>>>>> AVAIL_RAM+SWAP == 1958Mi+3685Mi == 5643Mi
>>>>>>>
>>>>>>> and later had "Max(imum)Obs(erved)" figures:
>>>>>>>
>>>>>>> Mem: . . .,
>>>>>>> 1728Mi MaxObsActive, 275192Ki MaxObsWired, 1952Mi MaxObs(Act+Wir+Lndry)
>>>>>>>
>>>>>>> Swap: 3685Mi Total, . . .,
>>>>>>> 1535Mi MaxObsUsed, 3177Mi MaxObs(Act+Lndry+SwapUsed),
>>>>>>> 3398Mi MaxObs(A+Wir+L+SU), 3449Mi (A+W+L+SU+InAct)
>>>>>>>
>>>>>>>
>>>>>>> The aarch64-as-armv7 was a Win DevKit 2023 that has 8 cores and:
>>>>>>>
>>>>>>> AVAIL_RAM+SWAP == 31311Mi+120831Mi == 152142Mi
>>>>>>>
>>>>>>> So lots of 4 GiByte or smaller processes would fit.
>>>>>>>
>>>>>> Absent finding a way to get --threads=1 to be what is used, I
>>>>>> made the following crude way to test, built it, installed it
>>>>>> in the armv7 directory tree used for aarch64-as-armv7, and
>>>>>> then started an aarch64-as-armv7 test of building devel/llvm19
>>>>>> to see what the consequences are (leading whitespace details
>>>>>> might not be preserved):
>>>>>> # git -C /usr/main-src/ diff contrib/llvm-project/
>>>>>> diff --git a/contrib/llvm-project/lld/ELF/Driver.cpp b/contrib/llvm-project/lld/ELF/Driver.cpp
>>>>>> index 8b2c32b15348..299daf7dd6fa 100644
>>>>>> --- a/contrib/llvm-project/lld/ELF/Driver.cpp
>>>>>> +++ b/contrib/llvm-project/lld/ELF/Driver.cpp
>>>>>> @@ -1587,6 +1587,9 @@ static void readConfigs(opt::InputArgList &args) {
>>>>>>              arg->getValue() + "'");
>>>>>>      parallel::strategy = hardware_concurrency(threads);
>>>>>>      config->thinLTOJobs = v;
>>>>>> +  } else if (sizeof(void*) <= 4) {
>>>>>> +    log("set maximum concurrency to 1, specify --threads= to change");
>>>>>> +    parallel::strategy = hardware_concurrency(1);
>>>>>>    } else if (parallel::strategy.compute_thread_count() > 16) {
>>>>>>      log("set maximum concurrency to 16, specify --threads= to change");
>>>>>>      parallel::strategy = hardware_concurrency(16);
>>>>>> Basically, if the process address space has to be "small", avoid
>>>>>> any default memory use tradeoffs that multi-threading the linker
>>>>>> might involve --even if that means taking more time.
>>>>>> We will see if:
>>>>>> [00:00:33] [07] [00:00:00] Building   devel/llvm19@default | llvm19-19.1.0.r1
>>>>>> still fails to build as armv7 vs. if the change leads it to
>>>>>> manage to build as armv7.
>>>>>> ===
>>>>>> Mark Millard
>>>>>> marklmi at yahoo.com
>>>>>
>>>>> I can build llvm18 and rust 1.79 on native armv7  without problems - on Tegra TK1, without poudriere and on the ufs filesystem. IMHO poudriere is unusable on 32bit systems.
>>>>
>>>> On Windows DevKit 2023 in a armv7 chroot I can build rust 1.79.0
>>>> as well. I've not tried a recent devel/llvm18 in that context,
>>>> just devel/llvm19 . An armv7 process in this context can use
>>>> about 1 GiByte more memory space than on the OrangePi+ 2ed. (See
>>>> later program example outputs.)
>>>>
>>>> Previously, devel/llvm18-18.1.7 had built fine some time back.
>>>> So I'm trying the modern 18.1.8_1 now on the Windows DevKit 2023.
>>>> But this is with forcing of --threads=1 for lld: same context as
>>>> the recent devel/llvm19 exploration.
>>>>
>>>> Note: UFS context, not ZFS.
>>>>
>>>> How does the Tegra TK1 context compare for the following
>>>> program and the example command?
>>>>
>>>> OrangePi+ 2ed (so: armv7 native with 2 GiBytes of RAM):
>>>>
>>>> # more process_size.c
>>>> // cc -std=c11 process_size.c
>>>> // ./a.out 268435456 268435456 268435456 268435456 268435456 268435456 268435456 268435456 268435456 268435456 268435456 268435456 268435456 134217728 67108864 33554432 16777216 8388608 4194304 2097152 1048576
>>>>
>>>> #include <malloc.h>
>>>> #include <errno.h>
>>>> #include <stdio.h>
>>>> #include <stdlib.h>
>>>> #include <limits.h>
>>>>
>>>> int main(int argc, char *argv[])
>>>> {
>>>> size_t totalsize= 0u;
>>>> for (int i = 1; i < argc; ++i) {
>>>>    errno = 0;
>>>>    size_t size = strtoul(argv[i],NULL,0);
>>>>    void *p = malloc(size);
>>>>    if (p) totalsize += size;
>>>>    printf("malloc(%zu) = %p [errno = %d]\n", size, p, errno);
>>>> }
>>>> printf("approx. total, a lower bound: %zu MiBytes\n", totalsize/1024u/1024u);
>>>> return 0;
>>>> }
>>>> # cc -std=c11 process_size.c
>>>> # ./a.out 268435456 268435456 268435456 268435456 268435456 268435456 268435456 268435456 268435456 268435456 268435456 268435456 268435456 134217728 67108864 33554432 16777216 8388608 4194304 2097152 1048576
>>>> malloc(268435456) = 0x20800180 [errno = 0]
>>>> malloc(268435456) = 0x30801980 [errno = 0]
>>>> malloc(268435456) = 0x40802640 [errno = 0]
>>>> malloc(268435456) = 0x50803600 [errno = 0]
>>>> malloc(268435456) = 0x608048c0 [errno = 0]
>>>> malloc(268435456) = 0x70805140 [errno = 0]
>>>> malloc(268435456) = 0x80806580 [errno = 0]
>>>> malloc(268435456) = 0x90807780 [errno = 0]
>>>> malloc(268435456) = 0xa0808700 [errno = 0]
>>>> malloc(268435456) = 0x0 [errno = 12]
>>>> malloc(268435456) = 0x0 [errno = 12]
>>>> malloc(268435456) = 0x0 [errno = 12]
>>>> malloc(268435456) = 0x0 [errno = 12]
>>>> malloc(134217728) = 0xb0809a00 [errno = 0]
>>>> malloc(67108864) = 0x0 [errno = 12]
>>>> malloc(33554432) = 0xb880a5c0 [errno = 0]
>>>> malloc(16777216) = 0xba80b0c0 [errno = 0]
>>>> malloc(8388608) = 0x0 [errno = 12]
>>>> malloc(4194304) = 0x0 [errno = 12]
>>>> malloc(2097152) = 0xbb80c180 [errno = 0]
>>>> malloc(1048576) = 0xbba0de80 [errno = 0]
>>>> approx. total, a lower bound: 2483 MiBytes
>>>>
>>>>
>>>> Same program with same command on Windows DevKit 2023 in
>>>> armv7 chroot (aarch64-as-armv7 with 32 GiBytes of RAM):
>>>>
>>>> # ./a.out 268435456 268435456 268435456 268435456 268435456 268435456 268435456 268435456 268435456 268435456 268435456 268435456 268435456 134217728 67108864 33554432 16777216 8388608 4194304 2097152 1048576
>>>> malloc(268435456) = 0x20800b00 [errno = 0]
>>>> malloc(268435456) = 0x30801600 [errno = 0]
>>>> malloc(268435456) = 0x40802cc0 [errno = 0]
>>>> malloc(268435456) = 0x50803c80 [errno = 0]
>>>> malloc(268435456) = 0x608042c0 [errno = 0]
>>>> malloc(268435456) = 0x70805b00 [errno = 0]
>>>> malloc(268435456) = 0x808063c0 [errno = 0]
>>>> malloc(268435456) = 0x90807580 [errno = 0]
>>>> malloc(268435456) = 0xa0808b40 [errno = 0]
>>>> malloc(268435456) = 0xb0809980 [errno = 0]
>>>> malloc(268435456) = 0xc080abc0 [errno = 0]
>>>> malloc(268435456) = 0xd080ba00 [errno = 0]
>>>> malloc(268435456) = 0xe080cc80 [errno = 0]
>>>> malloc(134217728) = 0xf080d700 [errno = 0]
>>>> malloc(67108864) = 0x0 [errno = 12]
>>>> malloc(33554432) = 0xf880eb40 [errno = 0]
>>>> malloc(16777216) = 0xfa80fc00 [errno = 0]
>>>> malloc(8388608) = 0x0 [errno = 12]
>>>> malloc(4194304) = 0xfb810840 [errno = 0]
>>>> malloc(2097152) = 0xfbc117c0 [errno = 0]
>>>> malloc(1048576) = 0xfbe12940 [errno = 0]
>>>> approx. total, a lower bound: 3511 MiBytes
>>>>
>>>>
>>>> Note: If the Tegra TK1 in question has more than
>>>> 4 GiBytes  of RAM, the command line should explore
>>>> more than the example that I used.
>>>>
>>>>
>>>> Note: I've used the program for other patterns of
>>>> allocations. That is why it is not just a fixed
>>>> exploration algorithm.
>>>>
>>>>
>>>> As for poudriere-devel, I find it useful, even on
>>>> the OrangePi+ 2ed. But mostly that is a rare run
>>>> that is checking on how well the handling goes for
>>>> the 2 GiByte of RAM context (with notable SWAP for
>>>> the size of RAM). In other words, monitoring the
>>>> growth in a context that will break sooner than
>>>> my other contexts generally would. The tests take
>>>> days overall, most of the time being for rust and
>>>> a llvm* .
>>>>
>>>> Historically I've been able to have 2 builders,
>>>> each with MAKE_JOBS_NUMBER_LIMIT=2 , so all 4
>>>> cores in use building lang/rust and a devel/llvm*
>>>> at the same time successfully in poudriere-devel
>>>> on the 2 GiByte OrangePi+ 2ed. (This was before
>>>> recently imposing --threads=1 experiments,
>>>> given the recent build failures.)
>>> I should have noted that my normal devel/llvm* builds
>>> on aarch64 and armv7 avoid building: BE_AMDGPU and
>>> MLIR . They also target BE_NATIVE instead of
>>> BE_STANDARD . (aarch64 BE_NATIVE includes armv7 as
>>> well.)
>>> ===
>>> Mark Millard
>>> marklmi at yahoo.com
>> Tegra has 4 Cortex-A15 cores and 2 GB of RAM.
> 
> OrangePi+ 2ed: Cortex-A7 with 4 cores and 2 GiBytes of RAM.
> 
> I wonder if the 2483 MiBytes would end up being about the
> same on the Tegra variation indicated.

Yep, it must be +/-same. The 2/2 GB for userland/kernel is defined by 
HW.  Only size of shared libraries may affect (lower) usable user space 
for given program.
> 
>> All ports are built with default options. The only non-standard item is the swap size -> I have 16GB of swap on a swap partition on the SSD.
> 
> Wow, 16 GiBYtes of swap space for 2 GiBytes of RAM. I guess
> when the swap is added that you get a notice-pair of the
> structure:
> 
> QUOTE
> warning: total configured swap (. . . pages) exceeds maximum recommended amount (. . . pages).
> warning: increase kern.maxswzone or reduce amount of swap.
> END QUOTE
> 
> with a rather large difference between the two ". . ." figures.
> 
> Do you make other adjustments to deal with the otherwise-reported
> potential mistuning? It appears to make tradeoffs in the kernel
> internal memory handling, if I understand right.
The above message should be interpreted as: warning, the kernel may in 
word, rear case need to allocate additional
memory when swapping some object(memory) out. This may leads to 
deadlock/panic. But again, event is this warning valid,
resulting deadlock/panic is very rare. I newer see it in past many years...



>> But I guess that's not important in this case.
> 
> At least for my context, it appears that memory allocations
> are failing to find a big enough free area inside the
> process's address space --without running out of system
> RAM+SWAP space overall.
> 
> For the OrangePi+ 2ed ( and devel/llvm18 18.1.7 ) it was
> during the earlier linker run for:
> 
> FAILED: bin/lli-child-target
> . . .
> LLVM ERROR: out of memory
> Allocation failed
> 
> That much finished just fine on the Windows DevKit
> 2023 used via a armv7 jail ( devel/llvm18 18.1.8_1 ).
> The failure point was in a later link ( matching what
> I saw via devel/llvm19 ).
> 
>> I just started build of llvm19 - but it takes few hours to complete..
> 
> Probably fewer hours than on the OrangePi+ 2ed but
> more than on the Windows DevKit 2023 (if they were
> completing, anyway).
> 

The native build is still running (60% in fact), arm32 jail build has 
been stopped on my Honeycomb (killed by OOM).Unfortunately this is an 
old problem and is common on all platforms. The current LLVM cannot be 
built without additional tricks on machines that have less than 2GB (RAM 
+ swap) per core.....

Michal