From nobody Mon Jan 29 15:53:54 2024 X-Original-To: freebsd-arm@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4TNtC50JNcz58wrQ for ; Mon, 29 Jan 2024 15:54:09 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: from mail-lf1-x130.google.com (mail-lf1-x130.google.com [IPv6:2a00:1450:4864:20::130]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4TNtC41WSbz4kFn for ; Mon, 29 Jan 2024 15:54:08 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Authentication-Results: mx1.freebsd.org; none Received: by mail-lf1-x130.google.com with SMTP id 2adb3069b0e04-51032e62171so2477948e87.3 for ; Mon, 29 Jan 2024 07:54:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bsdimp-com.20230601.gappssmtp.com; s=20230601; t=1706543646; x=1707148446; darn=freebsd.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=cgy2daOrsxJC1nya0nGFzmIpH3kAc4FKzE49H4TN5L4=; b=teBLXGau8bpHnf+D0VBPDYyl4PEiFPRzzcBLUNYtftWo1vQJ8aoslaSEJHm5b1vGM4 K2j0M7lRsVCVjTEj6kDtsW37NhXcdJCpC6LpvOxfeQx+cnnzyplm76Q64zCQDD3adI2C fOuGWtpcN7/5T+yygbVPVVZXLjm6DxvbxL/nkAAz+h/L+ltozCkWWh01OixfkIz2KQ+A 1ejfmksulBc7/pXEBCUsIVN0b/GZPetJDkS21v2BgEOFBOqVFvBlNSWHgofgtvzYBBUt DGxLAP1DIuYF0MudJxURd8NcdrcQ5BtRJy0si2ESUv/msBGuLmQqXKYOSV4FzucyFAt3 OBTQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1706543646; x=1707148446; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=cgy2daOrsxJC1nya0nGFzmIpH3kAc4FKzE49H4TN5L4=; b=Ca7lVZ6t6tnj/0F6oTyTC8YNjxjjeORsW6CbYQPWuJAhO7Rv6CvEe6YvcDyC/tg1IH h62cwyLvgblfnojeefgHdT4rPMDDb4VugTtsGResMQXq+Ap7GxBFygdPg6uzDiupkgdL 3CJ9zuD9X7vZcj1ArA/jESXxvbY2kX8xChThPqEhRHT05+iX+A27nCkpadPB5uWZDK2G 84rcS7SQ4pddSsSQ7hUDZkjtj824DgOgo9U1UApeN+QT2BRt/M9E89ZB8GAyIY9e7Pwx nOhsgtvpmZQXO6I4bAxMB7TH04CLUBWotuwATeKZs2oGMji7GPX4ajZ36tj4bPNEIllr 9vpw== X-Gm-Message-State: AOJu0YxPbPOvGNsYjKYfj+wHT8mK2UhMUjF22eEepKh/DoGBvPdjEQ1m cGqpx0MuzKbhXbjTHNu5qibowceR49T14VhS8YM8fPPQOjY9+lC3uREuKmvFQGLF3WJPznOnEa7 tmL/OcYhOA+OvWHkk6Rf3hIMmBD0JviwnzmpqkA== X-Google-Smtp-Source: AGHT+IHPSB8BNFNYJyhbhkJA6dXtNMaykIiwsTDPqDl4BOQMLhKj/DEZnIZ5jM7ppMd6V5zboA5TVDRB6QdZbm6LPqc= X-Received: by 2002:a2e:855a:0:b0:2cd:936d:1009 with SMTP id u26-20020a2e855a000000b002cd936d1009mr3969138ljj.19.1706543645807; Mon, 29 Jan 2024 07:54:05 -0800 (PST) List-Id: Porting FreeBSD to ARM processors List-Archive: https://lists.freebsd.org/archives/freebsd-arm List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-arm@freebsd.org MIME-Version: 1.0 References: <6a33726b-eb6f-418e-9fbd-6d0b9b4bfaa8@madpilot.net> <0fc7f929-6e5b-4a33-97d2-8a9c0c07d524@madpilot.net> <79a5eb0f-d04e-4c1a-9d8a-185e1fb4e4a2@madpilot.net> <5ef2ab66-25ef-45f1-aa5a-4b614eab2f40@madpilot.net> <990427ae-0491-463e-92c7-c74700deb6fa@madpilot.net> In-Reply-To: <990427ae-0491-463e-92c7-c74700deb6fa@madpilot.net> From: Warner Losh Date: Mon, 29 Jan 2024 08:53:54 -0700 Message-ID: Subject: Re: qemu-user-static aarch64 lockup/race? (was Re: Python failure in poudriere on arm64 (via qemu-user-static cross compiling)) To: Guido Falsi Cc: Nathan Reilly-list , emulation@freebsd.org, "freebsd-arm@freebsd.org" , freebsd-pkg@freebsd.org Content-Type: multipart/alternative; boundary="000000000000cd7156061017a743" X-Rspamd-Queue-Id: 4TNtC41WSbz4kFn X-Spamd-Bar: ---- X-Rspamd-Pre-Result: action=no action; module=replies; Message is reply to one we originated X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[]; ASN(0.00)[asn:15169, ipnet:2a00:1450::/32, country:US] --000000000000cd7156061017a743 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Mon, Jan 29, 2024, 8:48=E2=80=AFAM Guido Falsi wrote: > On 29/01/24 09:26, Guido Falsi wrote: > > On 29/01/24 02:10, Warner Losh wrote: > >> > >> > >> On Sun, Jan 28, 2024 at 4:45=E2=80=AFPM Nathan Reilly-list >> > wrote: > >> > >> > >> > >>> On 29 Jan 2024, at 8:43=E2=80=AFam, Guido Falsi >>> > wrote: > >>> On 28/01/24 22:34, Guido Falsi wrote: > >>>> On 28/01/24 22:23, Warner Losh wrote: > >>>>> On Sun, Jan 28, 2024, 12:38=E2=80=AFPM Guido Falsi >>>>> >>>>> >> wrote: > >>>>> > >>>>> On 28/01/24 15:15, Guido Falsi wrote: > >>>>> [snip] > >>>>> > Creating repository in /tmp/packages: 0% > >>>>> > > >>>>> > >>>>> BTW, forgot to mention last time this worked without issue > >>>>> was around > >>>>> 20th December. > >>>>> > >>>>> > >>>>> I think this is a bsd-user issue. There is a race somewhere in > >>>>> that code that causes the hangs. I'd love a reproducible test > >>>>> case that is somewhat smaller than python... there are bigger > >>>>> races with the newer stuff and I've not had the time to chase i= t > >>>>> there either. =F0=9F=98=9E > >>>> First of all thanks for your feedback. It encourages me having > >>>> someone else with better knowledge about this confirm that a rac= e > >>>> condition is actually a possible cause! > >>>> Strange this has not been happening up to mid December. > >>>> My main and fully reproducible use case is actually mostly with > >>>> pkg. > >>>> at the end of the run poudriere runs `pkg repo` to create the > >>>> meta files and sign the repo. It forks itself (ncpus + 2 I guess= , > >>>> even forcing it to 1 worker I see three processes), and then > >>>> locks up, with all the processes stopping using CPU (ps output i= s > >>>> in my message) > >>>> I guess this can be reproduced with any poudriere repo with at > >>>> least more than ncpus packages in it. can also be reproduced > >>>> using `poudriere pkgclean -u ` > >>>> If that does not work I'm not sure how to reproduce it in other > >>>> ways, but I can try writing some code mocking what pkg seems to > >>>> be doing, not an expert at such things, though. > >>> > >>> In case it helps further norrow doen things, It looks like the > >>> lockup is happening somewhere around here: > >>> > >>> > >>> > https://github.com/freebsd/pkg/blob/56fa3f87d9d9644348b89680dfd8af47a860e= e82/libpkg/pkg_repo_create.c#L778 > < > https://github.com/freebsd/pkg/blob/56fa3f87d9d9644348b89680dfd8af47a860e= e82/libpkg/pkg_repo_create.c#L778 > > > >>> > >>> and/or in the pkg_create_repo_worker() function here: > >>> > >>> > >>> > https://github.com/freebsd/pkg/blob/56fa3f87d9d9644348b89680dfd8af47a860e= e82/libpkg/pkg_repo_create.c#L341 > < > https://github.com/freebsd/pkg/blob/56fa3f87d9d9644348b89680dfd8af47a860e= e82/libpkg/pkg_repo_create.c#L341 > > > >>> > >>> > >>> (I'm trying to spare you the time needed to find the actual code > >>> being executed, I guess you would have identified this in a few > >>> minutes yourself, but I'm trying to make myself useful) > >> > >> > >> There appears to be a GitHub issue for poudriere with this, but > >> seems to be looking in another direction. > >> > >> https://github.com/freebsd/poudriere/issues/1009 > >> > >> > > > > This one looks quite similar. > > > > In my case the ports/pkg are aligned between host and jail, in fact I > > have built them from the exact same git checkout. > > > > I noticed pkg head has been converted to using pthreads instead of fork= , > > maybe that could help. I will make time to perform some testing. > > Thanks for pointing me here, it looks like this was "it", in that by > fixing this issue it uses native pkg-static, and sidesteps the issue. > > > Unluckily there ARE qemu races and lockups that prevent arm64 pkg-static > binary to be correctly emulated by qemu-user-static. such conditions > also cause sporadic failures in some ports being built. > > I filed a PR with a fix for that issue: > > https://github.com/freebsd/poudriere/pull/1115 Ok. This dodges the problem. But it papers over things. Any chance you could give me the state of pkg before + the package added as a test case for qemu? Warner > > -- > Guido Falsi > > --000000000000cd7156061017a743 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


On Mon, Jan 29, 2024, 8:48=E2=80=AFAM Guido Falsi <= mad@madpilot.net> wrote:
On 29/01/24 09:26, Guido Falsi wrote:
> On 29/01/24 02:10, Warner Losh wrote:
>>
>>
>> On Sun, Jan 28, 2024 at 4:45=E2=80=AFPM Nathan Reilly-list <list= s@nreilly.com
>> <mailto:lists@nreilly.com>> wrote:
>>
>>
>>
>>> =C2=A0=C2=A0=C2=A0 On 29 Jan 2024, at 8:43=E2=80=AFam, Guido F= alsi <mad@madpilot.net
>>> =C2=A0=C2=A0=C2=A0 <mailto:mad@madpilot.net>> wrote= :
>>> =C2=A0=C2=A0=C2=A0 On 28/01/24 22:34, Guido Falsi wrote:
>>>> =C2=A0=C2=A0=C2=A0 On 28/01/24 22:23, Warner Losh wrote: >>>>> =C2=A0=C2=A0=C2=A0 On Sun, Jan 28, 2024, 12:38=E2=80= =AFPM Guido Falsi <mad@madpilot.net
>>>>> =C2=A0=C2=A0=C2=A0 <mailto:mad@madpilot.net> &= lt;mailto:mad@madpilot.net
>>>>> =C2=A0=C2=A0=C2=A0 <mailto:mad@madpilot.net>&g= t;> wrote:
>>>>>
>>>>> =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 On 28/01/24 15:1= 5, Guido Falsi wrote:
>>>>> =C2=A0=C2=A0=C2=A0 =C2=A0 =C2=A0 [snip]
>>>>> =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0 > Creat= ing repository in /tmp/packages:=C2=A0=C2=A0 0%
>>>>> =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0 >
>>>>>
>>>>> =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 BTW, forgot to m= ention last time this worked without issue
>>>>> =C2=A0=C2=A0=C2=A0 was around
>>>>> =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 20th December. >>>>>
>>>>>
>>>>> =C2=A0=C2=A0=C2=A0 I think this is a bsd-user issue. T= here is a race somewhere in
>>>>> =C2=A0=C2=A0=C2=A0 that code that causes the hangs. I&= #39;d love a reproducible test
>>>>> =C2=A0=C2=A0=C2=A0 case that is somewhat smaller than = python... there are bigger
>>>>> =C2=A0=C2=A0=C2=A0 races with the newer stuff and I= 9;ve not had the time to chase it
>>>>> =C2=A0=C2=A0=C2=A0 there either. =F0=9F=98=9E
>>>> =C2=A0=C2=A0=C2=A0 First of all thanks for your feedback. = It encourages me having
>>>> =C2=A0=C2=A0=C2=A0 someone else with better knowledge abou= t this confirm that a race
>>>> =C2=A0=C2=A0=C2=A0 condition is actually a possible cause!=
>>>> =C2=A0=C2=A0=C2=A0 Strange this has not been happening up = to mid December.
>>>> =C2=A0=C2=A0=C2=A0 My main and fully reproducible use case= is actually mostly with
>>>> pkg.
>>>> =C2=A0=C2=A0=C2=A0 at the end of the run poudriere runs `p= kg repo` to create the
>>>> =C2=A0=C2=A0=C2=A0 meta files and sign the repo. It forks = itself (ncpus + 2 I guess,
>>>> =C2=A0=C2=A0=C2=A0 even forcing it to 1 worker I see three= processes), and then
>>>> =C2=A0=C2=A0=C2=A0 locks up, with all the processes stoppi= ng using CPU (ps output is
>>>> =C2=A0=C2=A0=C2=A0 in my message)
>>>> =C2=A0=C2=A0=C2=A0 I guess this can be reproduced with any= poudriere repo with at
>>>> =C2=A0=C2=A0=C2=A0 least more than ncpus packages in it. c= an also be reproduced
>>>> =C2=A0=C2=A0=C2=A0 using `poudriere pkgclean -u <etc>= ;`
>>>> =C2=A0=C2=A0=C2=A0 If that does not work I'm not sure = how to reproduce it in other
>>>> =C2=A0=C2=A0=C2=A0 ways, but I can try=C2=A0 writing some = code mocking what pkg seems to
>>>> =C2=A0=C2=A0=C2=A0 be doing, not an expert at such things,= though.
>>>
>>> =C2=A0=C2=A0=C2=A0 In case it helps further norrow doen things= , It looks like the
>>> =C2=A0=C2=A0=C2=A0 lockup is happening somewhere around here:<= br> >>>
>>>=C2=A0 =C2=A0 =C2=A0
>>> https://github.com/freebsd/pkg/blob/56fa3f87d= 9d9644348b89680dfd8af47a860ee82/libpkg/pkg_repo_create.c#L778 <https://github.com/freebsd/pkg/blob/56fa3f87d9d9644348b89680dfd= 8af47a860ee82/libpkg/pkg_repo_create.c#L778>
>>>
>>> =C2=A0=C2=A0=C2=A0 and/or in the pkg_create_repo_worker() func= tion here:
>>>
>>>=C2=A0 =C2=A0 =C2=A0
>>> https://github.com/freebsd/pkg/blob/56fa3f87d= 9d9644348b89680dfd8af47a860ee82/libpkg/pkg_repo_create.c#L341 <https://github.com/freebsd/pkg/blob/56fa3f87d9d9644348b89680dfd= 8af47a860ee82/libpkg/pkg_repo_create.c#L341>
>>>
>>>
>>> =C2=A0=C2=A0=C2=A0 (I'm trying to spare you the time neede= d to find the actual code
>>> =C2=A0=C2=A0=C2=A0 being executed, I guess you would have iden= tified this in a few
>>> =C2=A0=C2=A0=C2=A0 minutes yourself, but I'm trying to mak= e myself useful)
>>
>>
>> =C2=A0=C2=A0=C2=A0 There appears to be a GitHub issue for poudrier= e=C2=A0with this, but
>> =C2=A0=C2=A0=C2=A0 seems to be looking in another direction.
>>
>> =C2=A0=C2=A0=C2=A0 https://githu= b.com/freebsd/poudriere/issues/1009
>> =C2=A0=C2=A0=C2=A0 <https://g= ithub.com/freebsd/poudriere/issues/1009>
>>
>
> This one looks quite similar.
>
> In my case the ports/pkg are aligned between host and jail, in fact I =
> have built them from the exact same git checkout.
>
> I noticed pkg head has been converted to using pthreads instead of for= k,
> maybe that could help. I will make time to perform some testing.

Thanks for pointing me here, it looks like this was "it", in that= by
fixing this issue it uses native pkg-static, and sidesteps the issue.


Unluckily there ARE qemu races and lockups that prevent arm64 pkg-static binary to be correctly emulated by qemu-user-static. such conditions
also cause sporadic failures in some ports being built.

I filed a PR with a fix for that issue:

https://github.com/freebsd/poudriere/pull/1= 115

Ok. This dodges the problem. But it papers over things.

Any chance you could give me the state o= f pkg before + the package added as a test case for qemu?

Warner

<= div dir=3D"auto">
=

--
Guido Falsi <mad@madpilot.net>

--000000000000cd7156061017a743--