Re: Hang ast / pipelk / piperd

From: Mark Johnston <markj_at_freebsd.org>
Date: Mon, 30 May 2022 14:15:43 UTC
On Mon, May 30, 2022 at 12:19:15AM +0200, Paul Floyd wrote:
> 
> On 5/27/22 22:13, Paul Floyd wrote:
> >
> > Hi
> >
> > I'm debugging two issues with Valgrind on FreeBSD 13.1 and 14, one on 
> > amd64 and one on i386.
> >
> ...
> > |Both hangs seem quite sensitive to timing - in both cases adding or 
> > changing nanosleep times seem to make them no longer hang. |
> > |Adding debug statements to Valgrind can also change the behaviour 
> > (and is also unsafe when not holding the scheduler lock). Does this 
> > look like a kernel bug? |
> 
> [...]
>
> Under gdb I see (and this is quite variable)
> 
> (gdb) info thread
>    Id   Target Id                 Frame
> * 1    LWP 100073 of process 861 vgModuleLocal_do_syscall_for_client_WRK 
> () at m_syswrap/syscall-amd64-freebsd.S:135
>    2    LWP 100215 of process 861 
> vgModuleLocal_do_syscall_for_client_WRK () at 
> m_syswrap/syscall-amd64-freebsd.S:135
>    3    LWP 100216 of process 861 0x00000000380bffac in do_syscall_WRK ()
>    4    LWP 100217 of process 861 0x00000000380bffac in do_syscall_WRK ()
>    5    LWP 100218 of process 861 0x00000000380bffac in do_syscall_WRK ()
>    6    LWP 100219 of process 861 0x00000000380bffac in do_syscall_WRK ()
>    7    LWP 100220 of process 861 0x00000000380bffac in do_syscall_WRK ()
>    8    LWP 100221 of process 861 0x00000000380bffac in do_syscall_WRK ()
>    9    LWP 100222 of process 861 0x00000000380bffac in do_syscall_WRK ()
>    10   LWP 100223 of process 861 0x00000000380bffac in do_syscall_WRK ()
>    11   LWP 100224 of process 861 0x00000000380bffac in do_syscall_WRK ()
>    12   LWP 100225 of process 861 0x00000000380bffac in do_syscall_WRK ()
>    13   LWP 100226 of process 861 0x00000000380bffac in do_syscall_WRK ()
>    14   LWP 100227 of process 861 0x00000000380bffac in do_syscall_WRK ()
>    15   LWP 100228 of process 861 0x00000000380bffac in do_syscall_WRK ()
> 
> do_syscall_WRK is the syscall interface for the Valgrind host, and that 
> will be the threads waiting for the lock.
> 
> Thread 1 and 2 are in do_syscall_for_client, the interface for guest
> syscalls. Thread 1 is doing a _umtx_op syscall, for the pthread_join. 
> Thrread 2 is doing a nanosleep. These are blocking syscalls so they
> release the lock before making the syscall to allow other threads to
> execute.
> 
> I think that in the snapshot above, the lock is released and one
> of threads 3 to 15 should be obtaining the lock and running.
> 
> That's where the kernel context switch / AST seems to be going wrong.
> 
> I don't see anything particularly wrong on the Valgrind side.
> 
> Any ideas what I can do to see why the context switch is hanging?

"procstat -kk <valgrind PID>" might help to reveal what's going on,
since it sounds like the hand/livelock is happening somewhere in the
kernel.