Re: qemu-user-static aarch64 lockup/race? (was Re: Python failure in poudriere on arm64 (via qemu-user-static cross compiling))

From: Guido Falsi <mad_at_madpilot.net>
Date: Mon, 29 Jan 2024 15:47:57 UTC
On 29/01/24 09:26, Guido Falsi wrote:
> On 29/01/24 02:10, Warner Losh wrote:
>>
>>
>> On Sun, Jan 28, 2024 at 4:45 PM Nathan Reilly-list <lists@nreilly.com 
>> <mailto:lists@nreilly.com>> wrote:
>>
>>
>>
>>>     On 29 Jan 2024, at 8:43 am, Guido Falsi <mad@madpilot.net
>>>     <mailto:mad@madpilot.net>> wrote:
>>>     On 28/01/24 22:34, Guido Falsi wrote:
>>>>     On 28/01/24 22:23, Warner Losh wrote:
>>>>>     On Sun, Jan 28, 2024, 12:38 PM Guido Falsi <mad@madpilot.net
>>>>>     <mailto:mad@madpilot.net> <mailto:mad@madpilot.net
>>>>>     <mailto:mad@madpilot.net>>> wrote:
>>>>>
>>>>>         On 28/01/24 15:15, Guido Falsi wrote:
>>>>>         [snip]
>>>>>          > Creating repository in /tmp/packages:   0%
>>>>>          >
>>>>>
>>>>>         BTW, forgot to mention last time this worked without issue
>>>>>     was around
>>>>>         20th December.
>>>>>
>>>>>
>>>>>     I think this is a bsd-user issue. There is a race somewhere in
>>>>>     that code that causes the hangs. I'd love a reproducible test
>>>>>     case that is somewhat smaller than python... there are bigger
>>>>>     races with the newer stuff and I've not had the time to chase it
>>>>>     there either. 😞
>>>>     First of all thanks for your feedback. It encourages me having
>>>>     someone else with better knowledge about this confirm that a race
>>>>     condition is actually a possible cause!
>>>>     Strange this has not been happening up to mid December.
>>>>     My main and fully reproducible use case is actually mostly with 
>>>> pkg.
>>>>     at the end of the run poudriere runs `pkg repo` to create the
>>>>     meta files and sign the repo. It forks itself (ncpus + 2 I guess,
>>>>     even forcing it to 1 worker I see three processes), and then
>>>>     locks up, with all the processes stopping using CPU (ps output is
>>>>     in my message)
>>>>     I guess this can be reproduced with any poudriere repo with at
>>>>     least more than ncpus packages in it. can also be reproduced
>>>>     using `poudriere pkgclean -u <etc>`
>>>>     If that does not work I'm not sure how to reproduce it in other
>>>>     ways, but I can try  writing some code mocking what pkg seems to
>>>>     be doing, not an expert at such things, though.
>>>
>>>     In case it helps further norrow doen things, It looks like the
>>>     lockup is happening somewhere around here:
>>>
>>>     
>>> https://github.com/freebsd/pkg/blob/56fa3f87d9d9644348b89680dfd8af47a860ee82/libpkg/pkg_repo_create.c#L778 <https://github.com/freebsd/pkg/blob/56fa3f87d9d9644348b89680dfd8af47a860ee82/libpkg/pkg_repo_create.c#L778>
>>>
>>>     and/or in the pkg_create_repo_worker() function here:
>>>
>>>     
>>> https://github.com/freebsd/pkg/blob/56fa3f87d9d9644348b89680dfd8af47a860ee82/libpkg/pkg_repo_create.c#L341 <https://github.com/freebsd/pkg/blob/56fa3f87d9d9644348b89680dfd8af47a860ee82/libpkg/pkg_repo_create.c#L341>
>>>
>>>
>>>     (I'm trying to spare you the time needed to find the actual code
>>>     being executed, I guess you would have identified this in a few
>>>     minutes yourself, but I'm trying to make myself useful)
>>
>>
>>     There appears to be a GitHub issue for poudriere with this, but
>>     seems to be looking in another direction.
>>
>>     https://github.com/freebsd/poudriere/issues/1009
>>     <https://github.com/freebsd/poudriere/issues/1009>
>>
> 
> This one looks quite similar.
> 
> In my case the ports/pkg are aligned between host and jail, in fact I 
> have built them from the exact same git checkout.
> 
> I noticed pkg head has been converted to using pthreads instead of fork, 
> maybe that could help. I will make time to perform some testing.

Thanks for pointing me here, it looks like this was "it", in that by 
fixing this issue it uses native pkg-static, and sidesteps the issue.


Unluckily there ARE qemu races and lockups that prevent arm64 pkg-static 
binary to be correctly emulated by qemu-user-static. such conditions 
also cause sporadic failures in some ports being built.

I filed a PR with a fix for that issue:

https://github.com/freebsd/poudriere/pull/1115


-- 
Guido Falsi <mad@madpilot.net>