Re: armv7-on-aarch64 stuck at urdlck
- Reply: Konstantin Belousov : "Re: armv7-on-aarch64 stuck at urdlck"
- In reply to: Mark Millard : "Re: armv7-on-aarch64 stuck at urdlck"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Tue, 23 Jul 2024 07:53:41 UTC
On 23.07.2024 5:49, Mark Millard wrote: > On Jul 22, 2024, at 12:36, Michal Meloun <meloun.michal@gmail.com> wrote: >> On 22. 7. 2024 19:27, Mark Millard wrote: >>> On Jul 22, 2024, at 09:41, meloun.michal@gmail.com wrote: >>> >>> >>>> On 22.07.2024 18:26, Mark Millard wrote: >>>> >>>>> On Jul 22, 2024, at 06:40, Michal Meloun <meloun.michal@gmail.com> wrote: >>>>> >>>>>> On 22.07.2024 13:46, Mark Millard wrote: >>>>>> >>>>>>> On Jul 21, 2024, at 22:59, Michal Meloun <meloun.michal@gmail.com> wrote: >>>>>>> >>>>>>>> I don't want to hijack the original thread, so I'm replying in a new one. >>>>>>>> >>>>>>>> My tegra track current, has been running 24/7 by building kernel/world and kde5 in a loop for a few years now. But I have never encountered the aforementioned lockup in native armv7. >>>>>>>> >>>>>>>> I have seen usermode mutex lockup in arm32 jail on aarch64, but only very rarely (once a month or so) and all my attempts to reproduce it in a more deterministic way have failed. Also, I don't think I've ever seen this with the debug version of libc. >>>>>>>> >>>>>>>> Unfortunately I also failed to reproduce given lockup using dlopen_test.c, neither on native armv7 or arm32 jail. >>>>>>>> >>>>>>>> Michal Meloun >>>>>>>> >>>>>>> What is the output of: >>>>>>> # readelf -a /libexec/ld-elf.so.1 | grep -E "(^[^ 0-9]|.*_rtld_get_stack_prot)" >>>>>>> in your armv7 context(s)? Does it include for likes of: >>>>>>> QUOTE >>>>>>> Symbol table '.symtab' contains 911 entries: >>>>>>> 903: 000000000001b9ac 16 FUNC GLOBAL DEFAULT 11 _rtld_get_stack_prot >>>>>>> END QUOTE >>>>>>> ` >>>>>>> vs. not? >>>>>>> Note that the "debug version of libc" being involved likely means that >>>>>>> DEBUG_FLAGS was defined. That in turn likely means that strip is not >>>>>>> being used. In such a case, I expect that the .symtab entry for >>>>>>> _rtld_get_stack_prot (and more) exists for such a context. >>>>>>> >>>>>> At tis time, I have standard (thus stripped, non-debug) version of runtime linker library installed. Thus it have only dynamic relocation record for _rtld_get_stack_prot: >>>>>> >>>>>> root@tegra124:~/dlopen_test # readelf -a /libexec/ld-elf.so.1 | grep -E "(^[^ 0-9]|.*_rtld_get_stack_prot)" >>>>>> ELF Header: >>>>>> Elf file type is DYN (Shared object file) >>>>>> Entry point 0x1449c >>>>>> There are 10 program headers, starting at offset 52 >>>>>> Program Headers: >>>>>> There are 23 section headers, starting at offset 0x1a448: >>>>>> Section Headers: >>>>>> Key to Flags: >>>>>> Dynamic section at offset 0x19fa4 contains 15 entries: >>>>>> Relocation section (.rel.dyn): >>>>>> r_offset r_info r_type st_value st_name >>>>>> Symbol table '.dynsym' contains 27 entries: >>>>>> 5: 000000000001ba0c 16 FUNC GLOBAL DEFAULT 12 _rtld_get_stack_prot@@FBSDprivate_1.0 (11) >>>>>> Notes at offset 0x00000174 with length 0x00000018: >>>>>> Histogram for bucket list length (total of 6 buckets): >>>>>> Histogram for bucket list length (total of 27 buckets): >>>>>> Version symbol section (.gnu.version): >>>>>> Version definition section (.gnu.version_d): >>>>>> Attribute Section: aeabi >>>>>> >>>>>> ------ >>>>>> >>>>>> root@tegra124:~/dlopen_test # ./dlopen_test >>>>>> root@tegra124:~/dlopen_test # >>>>>> >>>>> Just to be sure . . . >>>>> Did you at some point "pkg install cairo" (or analogous) so that >>>>> the following (or some vintage) were in place? >>>>> # ls -lodT /usr/local/lib/libcairo.so* >>>>> lrwxr-xr-x 1 root wheel - 21 Apr 29 19:45:15 2024 /usr/local/lib/libcairo.so -> libcairo.so.2.11704.0 >>>>> lrwxr-xr-x 1 root wheel - 21 Apr 29 19:45:15 2024 /usr/local/lib/libcairo.so.2 -> libcairo.so.2.11704.0 >>>>> -rwxr-xr-x 1 root wheel - 1118272 Apr 29 19:45:15 2024 /usr/local/lib/libcairo.so.2.11704.0 >>>>> # file /usr/local/lib/libcairo.so.2.11704.0 >>>>> /usr/local/lib/libcairo.so.2.11704.0: ELF 32-bit LSB shared object, ARM, EABI5 version 1 (FreeBSD), dynamically linked, for FreeBSD 15.0 (1500018), stripped >>>>> (Installing cairo would also install other things it needs.) >>>>> For the failing contexts, the a.out from dlopen_test.c will only >>>>> hang if the library (and what it requires) is actually there to >>>>> load. >>>>> >>>> Yep, i have cairo installed (but compiled from sources, not installed by pkg). And i have verified that dlopen() return success. >>>> In the meantime I tried all combinations (debud/stripped) of ld_elf and libthr. All combinations work without problems on the native system and in arm323 jail. >>>> >>> Thanks for the information. My personal builds, which are the >>> ones that work in my testing, are built on aarch64 as armv7 >>> instead of on amd64. The known failing ones are built on amd64. >>> But I've no more specific information suggesting a tie to the >>> type of build host for the world used. >>> >>> >>>> Btw, gdb has long had problems with stepping inside ld_elf. It's better to run the test program without it and connect to the test program to get the "correct" stack trace. >>>> >>>> >>> In part I was deliberately exploring what sequence leads to the >>> hangups vs. lack of hangups and the like: more context than a >>> backtrace of the stuck state can provide. >>> >>> But doing "./a.out &" and then "gdb -p..." to attach to it: >>> >>> _umtx_op () at _umtx_op.S:4 >>> >>> warning: 4 _umtx_op.S: No such file or directory >>> (gdb) bt >>> #0 _umtx_op () at _umtx_op.S:4 >>> #1 0x2036845c in _umtx_op_err (obj=0x4, op=12, val=0, uaddr=0x0, uaddr2=0x0) at /home/pkgbuild/worktrees/main/lib/libsys/_umtx_op_err.c:36 >>> #2 0x20115da8 in __thr_rwlock_rdlock (rwlock=0x4, rwlock@entry=0x20137c40, flags=3, tsp=<optimized out>) at /home/pkgbuild/worktrees/main/lib/libthr/thread/thr_umtx.c:294 >>> #3 0x2010ebf4 in _thr_rwlock_rdlock (rwlock=0x20137c40, flags=0, tsp=0x0) at /home/pkgbuild/worktrees/main/lib/libthr/thread/thr_umtx.h:229 >>> #4 _thr_rtld_rlock_acquire (lock=0x20137c40) at /home/pkgbuild/worktrees/main/lib/libthr/thread/thr_rtld.c:121 >>> #5 0x20060788 in rlock_acquire (lock=0x2008af10 <rtld_locks>, lockstate=lockstate@entry=0xffffd114) at /home/pkgbuild/worktrees/main/libexec/rtld-elf/rtld_lock.c:259 >>> #6 0x20059098 in _rtld_bind (obj=0x2008f404, reloff=496) at /home/pkgbuild/worktrees/main/libexec/rtld-elf/rtld.c:1035 >>> #7 0x2005483c in _rtld_bind_start () at /home/pkgbuild/worktrees/main/libexec/rtld-elf/arm/rtld_start.S:89 >>> #8 0x2005483c in _rtld_bind_start () at /home/pkgbuild/worktrees/main/libexec/rtld-elf/arm/rtld_start.S:89 >>> #9 0x2005483c in _rtld_bind_start () at /home/pkgbuild/worktrees/main/libexec/rtld-elf/arm/rtld_start.S:89 >>> . . . >>> >>> It does not seem significantly different than I'd reported >>> for the hungup state. >>> >>> An issue here is that the pkgbase world possibly is -O2 based >>> despite having debug information (but is stripped). This can >>> make details less reliable. So, for example, the rwlock=0x4 >>> vs. rwlock@entry=0x20137c40 for __thr_rwlock_rdlock could well >>> be suspect. >>> >>> >> >> IMHO, -O2 shouldn't be able to modify function arguments for public functions, so <guessing> this memory corruption fits perfectly with the observed behavior</guessing>. > > It is not a memory corruption. r0 is "argument 1/scratch register/result" and > the code in question in my example is (__thr_rwlock_rdlock via disass /s use): > > 280 { > 0x20115d50 <+0>: push {r11, lr} > 0x20115d54 <+4>: mov r11, sp > 0x20115d58 <+8>: sub sp, sp, #32 > 0x20115d5c <+12>: mov r12, r1 > . . . > 291 tm_p = &timeout; > 292 tm_size = sizeof(timeout); > 293 } > 294 return (_umtx_op_err(rwlock, UMTX_OP_RW_RDLOCK, flags, > 0x20115d98 <+72>: str r1, [sp] > 0x20115d9c <+76>: mov r1, #12 > 0x20115da0 <+80>: mov r2, r12 > 0x20115da4 <+84>: bl 0x201167a0 > => 0x20115da8 <+88>: mov sp, r11 > 0x20115dac <+92>: pop {r11, pc} > > After the "bl 0x201167a0" the value of r0 is the return > value from 0x201167a0, not the first argument value > for 0x20115d50 . A better reporting would indicate that > rwlock was <optimized out> at that point: locally > the value has not been preserved at that point because > there is no more use of the value. > > But such is the kind of thing I expect to run into for > the likes of -O2 use with debug information. > > Anyway, _umtx_op_err returned the 0x4 value that is shown > for rwlock . > Yes, right, of course. Sorry for noise. The good news is that I'm finally able to generate a working/locking test case. The culprit (at least for me) is if "-mcpu" is used when compiling libthr (e.g. indirectly injected via CPUTYPE in /etc/make.conf). If it is not used, libthr is broken (regardless of -O level or debug/normal build), but -mcpu=cortex-a15 will always produce a working libthr.