Re: Buildworld Taking Very Long Time

From: Edward Sanford Sutton, III <mirror176_at_hotmail.com>
Date: Sun, 30 Jun 2024 18:48:26 UTC
On 6/30/24 08:11, Tim Daneliuk wrote:
> We do a nightly pull of -STABLE and then a buildworld/buildkernel

stable/14, stable/13, and/or no longer supported stable versions?

> The world and kernel build typically has been taking about 45-60min on 
> one of
> our quad core i5 machines.

i5 narrows it down to 19(?) generations of CPUs. 4 core cuts it down to 
about 9. CPU performance/features can vary a lot across those 
generations+models.

> For no obvious reason, it's now taking dozens of hours.  Any insight on 
> why this
> might be happening would be appreciated.

   My system using meta mode and ccache for stable/14 if running a build 
attempt with filesystem data cached in RAM after a build completes 
causes it to rerun within minutes on my i7-3820 using only 6 cpu cores, 
32GB RAM, and a single magnetic hard drive. Running an update when clang 
has been updated takes hours (not tens of hours) and I thought I recall 
a decent amount of time goes to openssl too. A full build after cleanup 
of the work directory should still be below 10 hours; my last timing was 
with 4 cores on the otherwise same hardware given a -j16 and took less 
than 6 hours ago but it was long enough ago I don't remember if it was 
timed during /13 or /12 (I delayed the 14 update for a while but may be 
new enough to be in that window but not of that build). Been a number of 
days since the last clang update in /14 but openssl did just get 
updated; still doesn't likely explain 1 hour to 1 day+ buildtime change.

   More build hardware+software setup is likely needed:
   Specific CPU, preferably RAM total+speed, what storage media 
(magnetic/ssd, models, array configuration if RAID.
   What filesystem is on the drives. Any build customizations (ccache, 
WITH_META_MODE, altered compiler flags, number of make jobs). What 
version of OS. If PORTS_MODULES is defined it can add additional 
complete compilers to the build process among other things from the 
ports tree depending on its state and the state of currently installed 
packages.
   Have you observed any unusual stats like lower CPU, higher disk I/O & 
% busy compared to a typical run? If you don't have specific stats you 
could glance at how things appear with top, systat, etc. to start 
getting an idea.
   Do you know what steps in the world/kernel are taking long? You can 
separate buildworkd and buildkernel into separate commands and time them 
separately. `make -s buildworld` will suppress a lot of output which 
helps see stages messages and the entire build can be logged. I don't 
know how but I imagine there is a way to do it with timestamps throughout.
   Using magnetic media, ZFS with compression, ccache, and leaving 
atime=on can lead to horrendous disk performance. I 'think' atime causes 
fragmentation of file metadata (even listing large directory contents 
takes forever) but even if not you still have 1 write for every file 
read; disabling it likely causes ccache to clear the cache as a first-in 
first-out sequence instead of removing what hasn't been used in the 
longest time. devel/ccache on a compressed dataset doesn't track sizes 
properly which sounded like zfs reports new cache entries are 0 bytes 
instead of returning its uncompressed size (compressed size can't be 
returned until compression algorithm is completed). This causes `ccache 
-s` cache size to exceed max cache size without triggering automatic 
cache cleanups; manually running `ccache -c` gets the cache back within 
limits which can make a much smaller cache and can have massive 
performance improvements if the file count was getting out of control. A 
very poorly performing ccache storage even reveals questionable calls to 
ccache from ports tree operations as basic non-compiling operations now 
become very slow with ccache disk I/O.
   I haven't had WITH_META_MODE cause a noticeable detriment to build 
times but have had it break builds until I ran `chflags -R noschg 
/usr/obj/usr;rm -rf /usr/obj/usr;cd /usr/src&&make cleandir&&make 
cleandir` though if trying to diagnose this for yourself and others it 
would be helpful if you moved/backup instead of removed the build 
directory contents so it could be further analyzed.
   Are there any other uses this machine has during build that could be 
hogging CPU/RAM/disk with other operations?
   Are CPU temperatures staying in proper range or could thermal 
throttling be ruining CPU performance? Disk I/O taking longer than 
expected on a filesystem with plenty of free space and reasonable 
file/directory count could indicate a drive issue; running SMART tests, 
reseating all drive cable connections (helps with dirt/minor corrosion; 
disconnect+connect several times), and making sure drive temperatures 
are within adequate ranges is good.