Re: armv7-on-aarch64 stuck at urdlck

From: Konstantin Belousov <kib_at_freebsd.org>
Date: Wed, 24 Jul 2024 15:47:16 UTC
On Wed, Jul 24, 2024 at 01:07:39PM +0000, John F Carr wrote:
> 
> 
> > On Jul 24, 2024, at 06:50, Konstantin Belousov <kib@freebsd.org> wrote:
> > 
> > On Wed, Jul 24, 2024 at 12:34:57PM +0200, mmel@freebsd.org wrote:
> >> 
> >> 
> >> On 24.07.2024 12:24, Konstantin Belousov wrote:
> >>> On Tue, Jul 23, 2024 at 08:11:13PM +0000, John F Carr wrote:
> >>>> On Jul 23, 2024, at 13:46, Michal Meloun <meloun.michal@gmail.com> wrote:
> >>>>> 
> >>>>> On 23.07.2024 11:36, Konstantin Belousov wrote:
> >>>>>> On Tue, Jul 23, 2024 at 09:53:41AM +0200, Michal Meloun wrote:
> >>>>>>> The good news is that I'm finally able to generate a working/locking
> >>>>>>> test case.  The culprit (at least for me) is if "-mcpu" is used when
> >>>>>>> compiling libthr (e.g. indirectly injected via CPUTYPE in /etc/make.conf).
> >>>>>>> If it is not used, libthr is broken (regardless of -O level or debug/normal
> >>>>>>> build), but -mcpu=cortex-a15 will always produce a working libthr.
> >>>>>> I think this is very significant progress.
> >>>>>> Do you plan to drill down more to see what is going on?
> >>>>> 
> >>>>> So the problem is now clear, and I fear it may apply to other architectures as well.
> >>>>> dlopen_object() (from rtld_elf),
> >>>>> https://cgit.freebsd.org/src/tree/libexec/rtld-elf/rtld.c#n3766,
> >>>>> holds the rtld_bind_lock write lock for almost the entire time a new library is loaded.
> >>>>> If the code uses a yet unresolved symbol to load the library, the rtl_bind() function attempts to get read lock of  rtld_bind_lock and a deadlock occurs.
> >>>>> 
> >>>>> In this case, it round_up() in _thr_stack_fix_protection,
> >>>>> https://cgit.freebsd.org/src/tree/lib/libthr/thread/thr_stack.c#n136.
> >>>>> Issued by __aeabi_uidiv (since not all armv7 processors support HW divide).
> >>>>> 
> >>>>> Unfortunately, I'm not sure how to fix it.  The compiler can emit __aeabi_<> in any place, and I'm not sure if it can resolve all the symbols used by rtld_eld and libthr beforehand.
> >>>>> 
> >>>>> 
> >>>>> Michal
> >>>>> 
> >>>> 
> >>>> In this case (but not for all _aeabi_ functions) we can avoid division
> >>>> as long as page size is a power of 2.
> >>>> 
> >>>> The function is
> >>>> 
> >>>>   static inline size_t
> >>>>   round_up(size_t size)
> >>>>   {
> >>>>    if (size % _thr_page_size != 0)
> >>>>    size = ((size / _thr_page_size) + 1) *
> >>>>        _thr_page_size;
> >>>>    return size;
> >>>>   }
> >>>> 
> >>>> The body can be condensed to
> >>>> 
> >>>>   return (size + _thr_page_size - 1) & ~(_thr_page_size - 1);
> >>>> 
> >>>> This is shorter in both lines of code and instruction bytes.
> >>> 
> >>> Lets not allow this to be lost.  Could anybody confirm that the patch
> >>> below fixes the issue?
> >>> 
> >>> commit d560f4f6690a48476565278fd07ca131bf4eeb3c
> >>> Author: Konstantin Belousov <kib@FreeBSD.org>
> >>> Date:   Wed Jul 24 13:17:55 2024 +0300
> >>> 
> >>>     rtld: avoid division in __thr_map_stacks_exec()
> >>>     The function is called by rtld with the rtld bind lock write-locked,
> >>>     when fixing the stack permission during dso load.  Not every ARMv7 CPU
> >>>     supports the div, which causes the recursive entry into rtld to resolve
> >>>     the  __aeabi_uidiv symbol, causing self-lock.
> >>>     Workaround the problem by using roundup2() instead of open-coding less
> >>>     efficient formula.
> >>>     Diagnosed by:   mmel
> >>>     Based on submission by: John F Carr <jfc@mit.edu>
> >>>     Sponsored by:   The FreeBSD Foundation
> >>>     MFC after:      1 week
> >>> 
> > Just realized that it is wrong.  Stack size is user-controlled and it does
> > not need to be power of two.
> 
> Your change is correct.  _thr_page_size is set to getpagesize(),
> which is a power of 2.   The call to roundup2 takes a user-provided
> size and rounds it up to a multiple of the system page size.
> 
> I tested the change and it works.  My change also works and
> should compile to identical code.  I forgot there was a standard
> function to do the rounding.
Right, my bad, thank you for correcting my thinko.

> 
> > For final resolving of deadlocks, after a full day of digging, I'm very much
> >> incline  of adding -znow to the linker flags for libthr.so (and maybe also
> >> for ld-elf.so). The runtime cost of resolving all symbols at startup is very
> >> low. Direct pre-solving in _thr_rtld_init() is problematic for the _aeabi_*
> >> symbols, since they don't have an official C prototypes, and some are not
> >> compatible with C calling conventions.
> > I do not like it. `-z now' changes (breaks) the ABI and makes some symbols
> > not preemtible.
> > 
> > In the worst case, we would need a call to the asm routine which causes the
> > resolution of the _eabi_* symbols on arm.
> > 
> 
> It would also be possible to link libthr with libgcc.a and use a linker map
> to hide the _eabi_ symbols.
In principle yes, but if the ARM ABI states that _eabi symbols must be used,
and exported from libc, then this is also some form of ABI breakage.