Re: armv7-on-aarch64 stuck at urdlck

From: Mark Millard <marklmi_at_yahoo.com>
Date: Mon, 22 Jul 2024 17:27:20 UTC
On Jul 22, 2024, at 09:41, meloun.michal@gmail.com wrote:

> On 22.07.2024 18:26, Mark Millard wrote:
>> On Jul 22, 2024, at 06:40, Michal Meloun <meloun.michal@gmail.com> wrote:
>>> On 22.07.2024 13:46, Mark Millard wrote:
>>>> On Jul 21, 2024, at 22:59, Michal Meloun <meloun.michal@gmail.com> wrote:
>>>>> I don't want to hijack the original thread, so I'm replying in a new one.
>>>>> 
>>>>> My tegra track current, has been running 24/7 by building kernel/world and kde5 in a loop for a few years now. But I have never encountered the aforementioned lockup in native armv7.
>>>>> 
>>>>> I have seen usermode mutex lockup in arm32 jail on aarch64, but only very rarely (once a month or so) and all my attempts to reproduce it in a more deterministic way have failed. Also, I don't think I've ever seen this with the debug version of libc.
>>>>> 
>>>>> Unfortunately I also failed to reproduce given lockup using dlopen_test.c, neither on native armv7 or arm32 jail.
>>>>> 
>>>>> Michal Meloun
>>>> What is the output of:
>>>> # readelf -a /libexec/ld-elf.so.1 | grep -E "(^[^ 0-9]|.*_rtld_get_stack_prot)"
>>>> in your armv7 context(s)? Does it include for likes of:
>>>> QUOTE
>>>> Symbol table '.symtab' contains 911 entries:
>>>>  903: 000000000001b9ac    16 FUNC    GLOBAL DEFAULT   11 _rtld_get_stack_prot
>>>> END QUOTE
>>>> `
>>>> vs. not?
>>>> Note that the "debug version of libc" being involved likely means that
>>>> DEBUG_FLAGS was defined. That in turn likely means that strip is not
>>>> being used. In such a case, I expect that the .symtab entry for
>>>> _rtld_get_stack_prot (and more) exists for such a context.
>>> At tis time, I have standard (thus stripped, non-debug) version of runtime linker library installed. Thus it have only dynamic relocation record for _rtld_get_stack_prot:
>>> 
>>> root@tegra124:~/dlopen_test # readelf -a /libexec/ld-elf.so.1 | grep -E "(^[^ 0-9]|.*_rtld_get_stack_prot)"
>>> ELF Header:
>>> Elf file type is DYN (Shared object file)
>>> Entry point 0x1449c
>>> There are 10 program headers, starting at offset 52
>>> Program Headers:
>>> There are 23 section headers, starting at offset 0x1a448:
>>> Section Headers:
>>> Key to Flags:
>>> Dynamic section at offset 0x19fa4 contains 15 entries:
>>> Relocation section (.rel.dyn):
>>> r_offset r_info   r_type              st_value st_name
>>> Symbol table '.dynsym' contains 27 entries:
>>>     5: 000000000001ba0c    16 FUNC    GLOBAL DEFAULT   12 _rtld_get_stack_prot@@FBSDprivate_1.0 (11)
>>> Notes at offset 0x00000174 with length 0x00000018:
>>> Histogram for bucket list length (total of 6 buckets):
>>> Histogram for bucket list length (total of 27 buckets):
>>> Version symbol section (.gnu.version):
>>> Version definition section (.gnu.version_d):
>>> Attribute Section: aeabi
>>> 
>>> ------
>>> 
>>> root@tegra124:~/dlopen_test # ./dlopen_test
>>> root@tegra124:~/dlopen_test #
>> Just to be sure . . .
>> Did you at some point "pkg install cairo" (or analogous) so that
>> the following (or some vintage) were in place?
>> # ls -lodT /usr/local/lib/libcairo.so*
>> lrwxr-xr-x  1 root wheel -      21 Apr 29 19:45:15 2024 /usr/local/lib/libcairo.so -> libcairo.so.2.11704.0
>> lrwxr-xr-x  1 root wheel -      21 Apr 29 19:45:15 2024 /usr/local/lib/libcairo.so.2 -> libcairo.so.2.11704.0
>> -rwxr-xr-x  1 root wheel - 1118272 Apr 29 19:45:15 2024 /usr/local/lib/libcairo.so.2.11704.0
>> # file /usr/local/lib/libcairo.so.2.11704.0
>> /usr/local/lib/libcairo.so.2.11704.0: ELF 32-bit LSB shared object, ARM, EABI5 version 1 (FreeBSD), dynamically linked, for FreeBSD 15.0 (1500018), stripped
>> (Installing cairo would also install other things it needs.)
>> For the failing contexts, the a.out from dlopen_test.c will only
>> hang if the library (and what it requires) is actually there to
>> load.
> Yep, i have cairo installed (but compiled from sources, not installed by pkg). And i have verified that dlopen() return success.
> In the meantime I tried all combinations (debud/stripped) of ld_elf and libthr. All combinations work without problems on the native system and in arm323 jail.

Thanks for the information. My personal builds, which are the
ones that work in my testing, are built on aarch64 as armv7
instead of on amd64. The known failing ones are built on amd64.
But I've no more specific information suggesting a tie to the
type of build host for the world used.

> Btw, gdb has long had problems with stepping inside ld_elf. It's better to run the test program without it and connect to the test program to get the "correct" stack trace.
> 

In part I was deliberately exploring what sequence leads to the
hangups vs. lack of hangups and the like: more context than a
backtrace of the stuck state can provide.

But doing "./a.out &" and then "gdb -p..." to attach to it:

_umtx_op () at _umtx_op.S:4

warning: 4 _umtx_op.S: No such file or directory
(gdb) bt
#0  _umtx_op () at _umtx_op.S:4
#1  0x2036845c in _umtx_op_err (obj=0x4, op=12, val=0, uaddr=0x0, uaddr2=0x0) at /home/pkgbuild/worktrees/main/lib/libsys/_umtx_op_err.c:36
#2  0x20115da8 in __thr_rwlock_rdlock (rwlock=0x4, rwlock@entry=0x20137c40, flags=3, tsp=<optimized out>) at /home/pkgbuild/worktrees/main/lib/libthr/thread/thr_umtx.c:294
#3  0x2010ebf4 in _thr_rwlock_rdlock (rwlock=0x20137c40, flags=0, tsp=0x0) at /home/pkgbuild/worktrees/main/lib/libthr/thread/thr_umtx.h:229
#4  _thr_rtld_rlock_acquire (lock=0x20137c40) at /home/pkgbuild/worktrees/main/lib/libthr/thread/thr_rtld.c:121
#5  0x20060788 in rlock_acquire (lock=0x2008af10 <rtld_locks>, lockstate=lockstate@entry=0xffffd114) at /home/pkgbuild/worktrees/main/libexec/rtld-elf/rtld_lock.c:259
#6  0x20059098 in _rtld_bind (obj=0x2008f404, reloff=496) at /home/pkgbuild/worktrees/main/libexec/rtld-elf/rtld.c:1035
#7  0x2005483c in _rtld_bind_start () at /home/pkgbuild/worktrees/main/libexec/rtld-elf/arm/rtld_start.S:89
#8  0x2005483c in _rtld_bind_start () at /home/pkgbuild/worktrees/main/libexec/rtld-elf/arm/rtld_start.S:89
#9  0x2005483c in _rtld_bind_start () at /home/pkgbuild/worktrees/main/libexec/rtld-elf/arm/rtld_start.S:89
. . .

It does not seem significantly different than I'd reported
for the hungup state.

An issue here is that the pkgbase world possibly is -O2 based
despite having debug information (but is stripped). This can
make details less reliable. So, for example, the rwlock=0x4
vs. rwlock@entry=0x20137c40 for __thr_rwlock_rdlock could well
be suspect.


===
Mark Millard
marklmi at yahoo.com