Re: qemu-user-static aarch64 lockup/race? (was Re: Python failure in poudriere on arm64 (via qemu-user-static cross compiling))

From: Guido Falsi <mad_at_madpilot.net>
Date: Mon, 29 Jan 2024 08:26:58 UTC
On 29/01/24 02:10, Warner Losh wrote:
> 
> 
> On Sun, Jan 28, 2024 at 4:45 PM Nathan Reilly-list <lists@nreilly.com 
> <mailto:lists@nreilly.com>> wrote:
> 
> 
> 
>>     On 29 Jan 2024, at 8:43 am, Guido Falsi <mad@madpilot.net
>>     <mailto:mad@madpilot.net>> wrote:
>>     On 28/01/24 22:34, Guido Falsi wrote:
>>>     On 28/01/24 22:23, Warner Losh wrote:
>>>>     On Sun, Jan 28, 2024, 12:38 PM Guido Falsi <mad@madpilot.net
>>>>     <mailto:mad@madpilot.net> <mailto:mad@madpilot.net
>>>>     <mailto:mad@madpilot.net>>> wrote:
>>>>
>>>>         On 28/01/24 15:15, Guido Falsi wrote:
>>>>         [snip]
>>>>          > Creating repository in /tmp/packages:   0%
>>>>          >
>>>>
>>>>         BTW, forgot to mention last time this worked without issue
>>>>     was around
>>>>         20th December.
>>>>
>>>>
>>>>     I think this is a bsd-user issue. There is a race somewhere in
>>>>     that code that causes the hangs. I'd love a reproducible test
>>>>     case that is somewhat smaller than python... there are bigger
>>>>     races with the newer stuff and I've not had the time to chase it
>>>>     there either. 😞
>>>     First of all thanks for your feedback. It encourages me having
>>>     someone else with better knowledge about this confirm that a race
>>>     condition is actually a possible cause!
>>>     Strange this has not been happening up to mid December.
>>>     My main and fully reproducible use case is actually mostly with pkg.
>>>     at the end of the run poudriere runs `pkg repo` to create the
>>>     meta files and sign the repo. It forks itself (ncpus + 2 I guess,
>>>     even forcing it to 1 worker I see three processes), and then
>>>     locks up, with all the processes stopping using CPU (ps output is
>>>     in my message)
>>>     I guess this can be reproduced with any poudriere repo with at
>>>     least more than ncpus packages in it. can also be reproduced
>>>     using `poudriere pkgclean -u <etc>`
>>>     If that does not work I'm not sure how to reproduce it in other
>>>     ways, but I can try  writing some code mocking what pkg seems to
>>>     be doing, not an expert at such things, though.
>>
>>     In case it helps further norrow doen things, It looks like the
>>     lockup is happening somewhere around here:
>>
>>     https://github.com/freebsd/pkg/blob/56fa3f87d9d9644348b89680dfd8af47a860ee82/libpkg/pkg_repo_create.c#L778 <https://github.com/freebsd/pkg/blob/56fa3f87d9d9644348b89680dfd8af47a860ee82/libpkg/pkg_repo_create.c#L778>
>>
>>     and/or in the pkg_create_repo_worker() function here:
>>
>>     https://github.com/freebsd/pkg/blob/56fa3f87d9d9644348b89680dfd8af47a860ee82/libpkg/pkg_repo_create.c#L341 <https://github.com/freebsd/pkg/blob/56fa3f87d9d9644348b89680dfd8af47a860ee82/libpkg/pkg_repo_create.c#L341>
>>
>>
>>     (I'm trying to spare you the time needed to find the actual code
>>     being executed, I guess you would have identified this in a few
>>     minutes yourself, but I'm trying to make myself useful)
> 
> 
>     There appears to be a GitHub issue for poudriere with this, but
>     seems to be looking in another direction.
> 
>     https://github.com/freebsd/poudriere/issues/1009
>     <https://github.com/freebsd/poudriere/issues/1009>
> 

This one looks quite similar.

In my case the ports/pkg are aligned between host and jail, in fact I 
have built them from the exact same git checkout.

I noticed pkg head has been converted to using pthreads instead of fork, 
maybe that could help. I will make time to perform some testing.

> 
> There's a FreeBSD bug saying this is happening w/o qemu in the loop. 
> https://bugs.freebsd.org/276690 <https://bugs.freebsd.org/276690> at 
> least I think that's similar.

There are similarities but they are looking at the compiler, which has 
no relation with pkg-repo getting stuck. That's what I'm concentrating 
on at present.

Also the sporadic issue with python is not due to compiler, it is the 
python binaries running during the build causing issues.
> 
> Warner

-- 
Guido Falsi <mad@madpilot.net>