Re: llvm10 build failure on Rpi3
Date: Thu, 24 Jun 2021 16:01:09 UTC
[What about trying a new kernel? details at end] On Wed, Jun 23, 2021 at 11:02:02PM -0700, Mark Millard wrote: > On 2021-Jun-23, at 21:30, bob prohaska <fbsd T www.zefox.net> wrote: > > > On Wed, Jun 23, 2021 at 04:22:35PM -0700, Mark Millard wrote: > >> On 2021-Jun-23, at 15:28, bob prohaska <fbsd at www.zefox.net> wrote: > >> . . . > > > >> > > [snipped for brevity] > >> > >>>> For example, 0xA5u byte values might be the value that newly > >>>> allocated memory is initialized to. Looking . . . man jemalloc > >>>> (the memory allocator implementation used by FreeBSD) reports: > >>>> > >>>> opt.junk (const char *) r- [--enable-fill] > >>>> Junk filling. If set to ???alloc???, each byte of uninitialized > >>>> allocated memory will be initialized to 0xa5. If set to ???free???, all > >>>> deallocated memory will be initialized to 0x5a. If set to ???true???, > >>>> both allocated and deallocated memory will be initialized, and if > >>>> set to ???false???, junk filling be disabled entirely. This is intended > >>>> for debugging and will impact performance negatively. This option > >>>> is ???false??? by default unless --enable-debug is specified during > >>>> configuration, in which case it is ???true??? by default. > >>>> > >>>> So, if you have junk filling enabled, I expect that you ran > >>>> into a legitimate defect in the llvm-tblgen in use. Having > >>>> Junk Filling disabled might be a workaround. > >>>> > >>>> There is /etc/malloc.conf as a way of controlling the behavior: > >>>> > >>>> ln -s 'junk:false' /usr/local/poudriere/poudriere-system/etc/malloc.conf > >>>> > >>>> I suggest you retry building after getting the above in place. > >>>> If it does not get the 0xA5A5A5A5u value, that would be > >>>> more evidence of a uninitialized-memory defect in the llvm-tblgen > >>>> involved. > >>>> > >>> Done and running now. In the interim I tried building llvm10 using > >>> make in /usr/ports, but it failed with another python conflict. > >> > > The poudriere session just ended, with a somewhat different error: > > > > In file included from /wrkdirs/usr/ports/devel/llvm10/work/llvm-10.0.1.src/lib/Target/AArch64/AArch64InstructionSelector > > .cpp:312: > > lib/Target/AArch64/AArch64GenGlobalISel.inc:1900:41: error: expected expression > > /*GIM_CheckRegBankForClass: @0*/, /*MI*/1, /*Op*/2, /*RC*//*AArch64::FPR64RegClassID: @0*/, > > ^ > > lib/Target/AArch64/AArch64GenGlobalISel.inc:1900:99: error: expected expression > > /*GIM_CheckRegBankForClass: @0*/, /*MI*/1, /*Op*/2, /*RC*//*AArch64::FPR64RegClassID: @0*/, > > ^ > > 2 errors generated. > > [ 25% 1396/5364] > > > > The last line is included as a fiducial indicator. Two errors instead of > > four, nothing about AMDGPU. > > You have a prior run that also showed only 2 errors: > > http://www.zefox.org/~bob/poudriere/data/logs/bulk/main-default/2021-06-21_12h55m51s/logs/errors/llvm10-10.0.1_5.log > > has: > > lib/Target/AMDGPU/AMDGPUGenGlobalISel.inc:15822:50: error: expected expression > /*GIM_CheckRegBankForClass: @2779096485*/, /*MI*/0, /*Op*/0, /*RC*//*AMDGPU::VGPR_32RegClassID: @2779096485*/, > ^ > lib/Target/AMDGPU/AMDGPUGenGlobalISel.inc:15822:118: error: expected expression > /*GIM_CheckRegBankForClass: @2779096485*/, /*MI*/0, /*Op*/0, /*RC*//*AMDGPU::VGPR_32RegClassID: @2779096485*/, > ^ > 2 errors generated. > > And a prior one that shows 6 errors but for AArch64 instead of AMDGPU: > > http://www.zefox.org/~bob/poudriere/data/logs/bulk/main-default/2021-06-18_19h00m47s/logs/errors/llvm10-10.0.1_5.log > > has: > > lib/Target/AArch64/AArch64GenGlobalISel.inc:3760:50: error: expected expression > /*GIM_CheckRegBankForClass: @2779096485*/, /*MI*/1, /*Op*/1, /*RC*//*AArch64::FPR64RegClassID: @2779096485*/, > ^ > lib/Target/AArch64/AArch64GenGlobalISel.inc:3760:117: error: expected expression > /*GIM_CheckRegBankForClass: @2779096485*/, /*MI*/1, /*Op*/1, /*RC*//*AArch64::FPR64RegClassID: @2779096485*/, > ^ > lib/Target/AArch64/AArch64GenGlobalISel.inc:5735:50: error: expected expression > /*GIM_CheckRegBankForClass: @2779096485*/, /*MI*/0, /*Op*/1, /*RC*//*AArch64::GPR64RegClassID: @2779096485*/, > ^ > lib/Target/AArch64/AArch64GenGlobalISel.inc:5735:117: error: expected expression > /*GIM_CheckRegBankForClass: @2779096485*/, /*MI*/0, /*Op*/1, /*RC*//*AArch64::GPR64RegClassID: @2779096485*/, > ^ > lib/Target/AArch64/AArch64GenGlobalISel.inc:22981:50: error: expected expression > /*GIM_CheckRegBankForClass: @2779096485*/, /*MI*/0, /*Op*/1, /*RC*//*AArch64::GPR64spRegClassID: @2779096485*/, > ^ > lib/Target/AArch64/AArch64GenGlobalISel.inc:22981:119: error: expected expression > /*GIM_CheckRegBankForClass: @2779096485*/, /*MI*/0, /*Op*/1, /*RC*//*AArch64::GPR64spRegClassID: @2779096485*/, > ^ > 6 errors generated. > ninja: build stopped: subcommand failed. > *** Error code 1 > > It appears that the bug does not have reproducible details > but all of the examples that do not have junk:false show > @2779096485 . (And the only junk:false tried so far has @0 > instead.) > > Something is providing and/or using initialized memory. > > There is the possibility that swapping out and back in is > sometimes not provides pages with the intended content. > I state that as an example that we really can not claim > to know that llvm-tblgen itself is doing something wrong. > I'm not claiming to know what is actually happening. But > such would fit with contexts that have more RAM that > end up avoiding much of the paging/swapping also not > seeing the problem. > > But as in some past examples, you may have exposed a > problem with FreeBSD. > > >> Intersting. I'm unable to see a: > >> > >> /usr/local/poudriere/poudriere-system/etc/malloc.conf > >> > >> via what you have published. But I've no clue if such > >> an odd symbolic link would be expected to show up. > > Still true, but . . . > > Well, now: http://www.zefox.org/~bob/poudriere/ > shows a: junk:false > > Note that this is at the same level as poudriere-system/ > is shown. You might want to look and see if the file > system shows such a file at that level as well. > > This did not show up until after the build attempt had > finished from what I can tell. > > > The link seems visible to find and ls: > > root@www:/usr/local/poudriere # find . -name malloc.conf > > ./poudriere-system/etc/malloc.conf > > root@www:/usr/local/poudriere # more ./poudriere-system/etc/malloc.conf > > ./poudriere-system/etc/malloc.conf: No such file or directory > > root@www:/usr/local/poudriere # ls -l ./poudriere-system/etc/malloc.conf > > lrwxr-xr-x 1 root wheel 10 Jun 23 14:27 ./poudriere-system/etc/malloc.conf -> junk:false > > root@www:/usr/local/poudriere # > > > > The link seems invisible to cat and more, reporting "No such file...." > > The link is looking for a file called junk:false in the same > directory. It is not expected to find such a file. > > > I'm not sure what might be profitably tried next..... Suggestions welcome! > > First off, if the point is to get the RPi3B+ going > more than it is to get evidence about the problem, > I'd suggest booting an RPi4B with the same media > (adjusting config.txt as necessary) and trying the > build from that boot. If it builds, the media can > be moved back to the RPi3B+ for other activity. > The failed vs. built status does give some > information about the problem. Built would suggest > that paging/swapping was involved in the problem. > Failed might suggest otherwise. (I do not know > if there would be much paging/sapping, depending on > how much RAM the RPi4B had.) > > One experiment would be to use the same boot media on > an RPi4B but that had been told in config.txt to limit > itself to 1 GiByte of RAM --and to also try with all > the RAM being allowed. If the first fails but the > second works, that is probably nice evidence. If both > fail, that also is probably nice evidence. The other > two combinations are less clear what any implications > would be. > > (I'm not claiming that you have such a RPi4B that can > be made available for the duration of such experiments.) > > Another direction is messy: testing under stable/13 and/or > releng/13.0 vintages to see if it is somehow specific > to main [so: 14], having an analogous context to what is > known to fail under main (as much as reasonable). The > RPi4B two-RAM-sizes comparison/contrast type of test could > also be used. > > There is also just repeating with junk:false a couple of > times to see if there is evidence of variability like > there is for without junk:false. Simplest of the > suggested tests, but likely the least informative. > > None of this would be likely to get close to a short, > small test that shows the problem. I've no clue how > to target that at this point. > How about booting an older kernel so see if that makes a difference? ls -dl /boot/kernel* reports drwxr-xr-x 2 root wheel 13824 Jun 18 18:15 /boot/kernel drwxr-xr-x 2 root wheel 13312 Jan 9 15:57 /boot/kernel.main-c255664-g4d64c7243d26 drwxr-xr-x 2 root wheel 13312 Aug 29 2020 /boot/kernel.mmccam drwxr-xr-x 2 root wheel 13824 Jun 9 18:52 /boot/kernel.old drwxr-xr-x 2 root wheel 13312 Aug 27 2020 /boot/kernel.r364346 drwxr-xr-x 2 root wheel 13312 Aug 29 2020 /boot/kernel.r364895 drwxr-xr-x 2 root wheel 13312 Sep 7 2020 /boot/kernel.r365355 Most of these are probably too old to work at all, but Jun 9 and Jan 9 might possibly work, I'd expect kernel.old to work as well. ISTR the previous success building chromium was early 2021 or before. Thanks for reading, any suggestions appreciated! bob prohaska