Re: qemu-user-static aarch64 lockup/race? (was Re: Python failure in poudriere on arm64 (via qemu-user-static cross compiling))

From: Guido Falsi <mad_at_madpilot.net>
Date: Sun, 28 Jan 2024 21:43:27 UTC
On 28/01/24 22:34, Guido Falsi wrote:
> On 28/01/24 22:23, Warner Losh wrote:
>>
>>
>> On Sun, Jan 28, 2024, 12:38 PM Guido Falsi <mad@madpilot.net 
>> <mailto:mad@madpilot.net>> wrote:
>>
>>     On 28/01/24 15:15, Guido Falsi wrote:
>>      > Hi all, again,
>>      >
>>      > I have some more findings about this, I'm top posting because the
>>     old
>>      > message is not really that much relevant anymore.
>>      >
>>      > I'm now running a machine with head (commit
>>      > b32d49cfbaa0437d08e65e7cd7c82c5951b1a852 Jan 25th), poudriere
>>     installed
>>      > in it, machine is amd64, with an arm64 jail, 14.0-RELEASE, 
>> installed
>>      > from official distribution binaries (https download method), with
>>     cross
>>      > tools.
>>      >
>>      > To make sure everything is aligned I rebuild everything: updated
>>     head,
>>      > rebuild cross tools in the jail, recompiled all ports for the host
>>      > architecture and force reinstalled them, especially
>>     qemu-user-static,
>>      > cleaned up all packages for the arm64 jail.
>>      >
>>      > If I missed something important please point it out.
>>      >
>>      > I have made some more tests and I'm getting python failures in
>>     poudriere
>>      > like the one described below from time to time (don't have hard
>>     stats
>>      > but feels like 50% chance). If I get past that it usually is 
>> able to
>>      > build all the not many packages, but locks up at:
>>      >
>>      > Creating repository in /tmp/packages:   0%
>>      >
>>
>>     BTW, forgot to mention last time this worked without issue was around
>>     20th December.
>>
>>
>> I think this is a bsd-user issue. There is a race somewhere in that 
>> code that causes the hangs. I'd love a reproducible test case that is 
>> somewhat smaller than python... there are bigger races with the newer 
>> stuff and I've not had the time to chase it there either. 😞
> 
> First of all thanks for your feedback. It encourages me having someone 
> else with better knowledge about this confirm that a race condition is 
> actually a possible cause!
> 
> Strange this has not been happening up to mid December.
> 
> My main and fully reproducible use case is actually mostly with pkg.
> 
> at the end of the run poudriere runs `pkg repo` to create the meta files 
> and sign the repo. It forks itself (ncpus + 2 I guess, even forcing it 
> to 1 worker I see three processes), and then locks up, with all the 
> processes stopping using CPU (ps output is in my message)
> 
> I guess this can be reproduced with any poudriere repo with at least 
> more than ncpus packages in it. can also be reproduced using `poudriere 
> pkgclean -u <etc>`
> 
> If that does not work I'm not sure how to reproduce it in other ways, 
> but I can try  writing some code mocking what pkg seems to be doing, not 
> an expert at such things, though.
> 

In case it helps further norrow doen things, It looks like the lockup is 
happening somewhere around here:

https://github.com/freebsd/pkg/blob/56fa3f87d9d9644348b89680dfd8af47a860ee82/libpkg/pkg_repo_create.c#L778

and/or in the pkg_create_repo_worker() function here:

https://github.com/freebsd/pkg/blob/56fa3f87d9d9644348b89680dfd8af47a860ee82/libpkg/pkg_repo_create.c#L341


(I'm trying to spare you the time needed to find the actual code being 
executed, I guess you would have identified this in a few minutes 
yourself, but I'm trying to make myself useful)

-- 
Guido Falsi <mad@madpilot.net>