Re: armv7-on-aarch64 stuck at urdlck

From: Michal Meloun <meloun.michal_at_gmail.com>
Date: Mon, 22 Jul 2024 19:36:00 UTC
On 22. 7. 2024 19:27, Mark Millard wrote:
> On Jul 22, 2024, at 09:41,meloun.michal@gmail.com  wrote:
>
>> On 22.07.2024 18:26, Mark Millard wrote:
>>> On Jul 22, 2024, at 06:40, Michal Meloun<meloun.michal@gmail.com>  wrote:
>>>> On 22.07.2024 13:46, Mark Millard wrote:
>>>>> On Jul 21, 2024, at 22:59, Michal Meloun<meloun.michal@gmail.com>  wrote:
>>>>>> I don't want to hijack the original thread, so I'm replying in a new one.
>>>>>>
>>>>>> My tegra track current, has been running 24/7 by building kernel/world and kde5 in a loop for a few years now. But I have never encountered the aforementioned lockup in native armv7.
>>>>>>
>>>>>> I have seen usermode mutex lockup in arm32 jail on aarch64, but only very rarely (once a month or so) and all my attempts to reproduce it in a more deterministic way have failed. Also, I don't think I've ever seen this with the debug version of libc.
>>>>>>
>>>>>> Unfortunately I also failed to reproduce given lockup using dlopen_test.c, neither on native armv7 or arm32 jail.
>>>>>>
>>>>>> Michal Meloun
>>>>> What is the output of:
>>>>> # readelf -a /libexec/ld-elf.so.1 | grep -E "(^[^ 0-9]|.*_rtld_get_stack_prot)"
>>>>> in your armv7 context(s)? Does it include for likes of:
>>>>> QUOTE
>>>>> Symbol table '.symtab' contains 911 entries:
>>>>>   903: 000000000001b9ac    16 FUNC    GLOBAL DEFAULT   11 _rtld_get_stack_prot
>>>>> END QUOTE
>>>>> `
>>>>> vs. not?
>>>>> Note that the "debug version of libc" being involved likely means that
>>>>> DEBUG_FLAGS was defined. That in turn likely means that strip is not
>>>>> being used. In such a case, I expect that the .symtab entry for
>>>>> _rtld_get_stack_prot (and more) exists for such a context.
>>>> At tis time, I have standard (thus stripped, non-debug) version of runtime linker library installed. Thus it have only dynamic relocation record for _rtld_get_stack_prot:
>>>>
>>>> root@tegra124:~/dlopen_test # readelf -a /libexec/ld-elf.so.1 | grep -E "(^[^ 0-9]|.*_rtld_get_stack_prot)"
>>>> ELF Header:
>>>> Elf file type is DYN (Shared object file)
>>>> Entry point 0x1449c
>>>> There are 10 program headers, starting at offset 52
>>>> Program Headers:
>>>> There are 23 section headers, starting at offset 0x1a448:
>>>> Section Headers:
>>>> Key to Flags:
>>>> Dynamic section at offset 0x19fa4 contains 15 entries:
>>>> Relocation section (.rel.dyn):
>>>> r_offset r_info   r_type              st_value st_name
>>>> Symbol table '.dynsym' contains 27 entries:
>>>>      5: 000000000001ba0c    16 FUNC    GLOBAL DEFAULT   12 _rtld_get_stack_prot@@FBSDprivate_1.0 (11)
>>>> Notes at offset 0x00000174 with length 0x00000018:
>>>> Histogram for bucket list length (total of 6 buckets):
>>>> Histogram for bucket list length (total of 27 buckets):
>>>> Version symbol section (.gnu.version):
>>>> Version definition section (.gnu.version_d):
>>>> Attribute Section: aeabi
>>>>
>>>> ------
>>>>
>>>> root@tegra124:~/dlopen_test # ./dlopen_test
>>>> root@tegra124:~/dlopen_test #
>>> Just to be sure . . .
>>> Did you at some point "pkg install cairo" (or analogous) so that
>>> the following (or some vintage) were in place?
>>> # ls -lodT /usr/local/lib/libcairo.so*
>>> lrwxr-xr-x  1 root wheel -      21 Apr 29 19:45:15 2024 /usr/local/lib/libcairo.so -> libcairo.so.2.11704.0
>>> lrwxr-xr-x  1 root wheel -      21 Apr 29 19:45:15 2024 /usr/local/lib/libcairo.so.2 -> libcairo.so.2.11704.0
>>> -rwxr-xr-x  1 root wheel - 1118272 Apr 29 19:45:15 2024 /usr/local/lib/libcairo.so.2.11704.0
>>> # file /usr/local/lib/libcairo.so.2.11704.0
>>> /usr/local/lib/libcairo.so.2.11704.0: ELF 32-bit LSB shared object, ARM, EABI5 version 1 (FreeBSD), dynamically linked, for FreeBSD 15.0 (1500018), stripped
>>> (Installing cairo would also install other things it needs.)
>>> For the failing contexts, the a.out from dlopen_test.c will only
>>> hang if the library (and what it requires) is actually there to
>>> load.
>> Yep, i have cairo installed (but compiled from sources, not installed by pkg). And i have verified that dlopen() return success.
>> In the meantime I tried all combinations (debud/stripped) of ld_elf and libthr. All combinations work without problems on the native system and in arm323 jail.
> Thanks for the information. My personal builds, which are the
> ones that work in my testing, are built on aarch64 as armv7
> instead of on amd64. The known failing ones are built on amd64.
> But I've no more specific information suggesting a tie to the
> type of build host for the world used.
>
>> Btw, gdb has long had problems with stepping inside ld_elf. It's better to run the test program without it and connect to the test program to get the "correct" stack trace.
>>
> In part I was deliberately exploring what sequence leads to the
> hangups vs. lack of hangups and the like: more context than a
> backtrace of the stuck state can provide.
>
> But doing "./a.out &" and then "gdb -p..." to attach to it:
>
> _umtx_op () at _umtx_op.S:4
>
> warning: 4 _umtx_op.S: No such file or directory
> (gdb) bt
> #0  _umtx_op () at _umtx_op.S:4
> #1  0x2036845c in _umtx_op_err (obj=0x4, op=12, val=0, uaddr=0x0, uaddr2=0x0) at /home/pkgbuild/worktrees/main/lib/libsys/_umtx_op_err.c:36
> #2  0x20115da8 in __thr_rwlock_rdlock (rwlock=0x4, rwlock@entry=0x20137c40, flags=3, tsp=<optimized out>) at /home/pkgbuild/worktrees/main/lib/libthr/thread/thr_umtx.c:294
> #3  0x2010ebf4 in _thr_rwlock_rdlock (rwlock=0x20137c40, flags=0, tsp=0x0) at /home/pkgbuild/worktrees/main/lib/libthr/thread/thr_umtx.h:229
> #4  _thr_rtld_rlock_acquire (lock=0x20137c40) at /home/pkgbuild/worktrees/main/lib/libthr/thread/thr_rtld.c:121
> #5  0x20060788 in rlock_acquire (lock=0x2008af10 <rtld_locks>, lockstate=lockstate@entry=0xffffd114) at /home/pkgbuild/worktrees/main/libexec/rtld-elf/rtld_lock.c:259
> #6  0x20059098 in _rtld_bind (obj=0x2008f404, reloff=496) at /home/pkgbuild/worktrees/main/libexec/rtld-elf/rtld.c:1035
> #7  0x2005483c in _rtld_bind_start () at /home/pkgbuild/worktrees/main/libexec/rtld-elf/arm/rtld_start.S:89
> #8  0x2005483c in _rtld_bind_start () at /home/pkgbuild/worktrees/main/libexec/rtld-elf/arm/rtld_start.S:89
> #9  0x2005483c in _rtld_bind_start () at /home/pkgbuild/worktrees/main/libexec/rtld-elf/arm/rtld_start.S:89
> . . .
>
> It does not seem significantly different than I'd reported
> for the hungup state.
>
> An issue here is that the pkgbase world possibly is -O2 based
> despite having debug information (but is stripped). This can
> make details less reliable. So, for example, the rwlock=0x4
> vs. rwlock@entry=0x20137c40 for __thr_rwlock_rdlock could well
> be suspect.
>

IMHO, -O2 shouldn't be able to modify function arguments for public 
functions, so <guessing> this memory corruption fits perfectly with the 
observed behavior</guessing>.

But , out of curiosity, a quick look at _thr_rwlock_tryrdlock() in 
thr_umtx.h:208 makes me wonder: How is the "state" variable inside the 
loop guaranteed to be updated? IMHO nothing inside the loop emits a 
global memory modification attribute, so the compiler is free to move 
the assignment to a "state" variable outside the loop.

Kib, please, do you have any comment on this?

MIchal Meloun