Re: Hang ast / pipelk / piperd

From: Mark Johnston <markj_at_freebsd.org>
Date: Wed, 01 Jun 2022 14:16:46 UTC
On Mon, May 30, 2022 at 09:35:05PM +0200, Paul Floyd wrote:
> 
> 
> On 5/30/22 14:15, Mark Johnston wrote:
> 
> > "procstat -kk <valgrind PID>" might help to reveal what's going on,
> > since it sounds like the hand/livelock is happening somewhere in the
> > kernel.
> 
> Not knowing much about the kernel, my guess is that this is related to
> 
> commit 4808bab7fa6c3ec49b49476b8326d7a0250a03fa
> Author: Alexander Motin <mav@FreeBSD.org>
> Date:   Tue Sep 21 18:14:22 2021 -0400
> 
>      sched_ule(4): Improve long-term load balancer.
> 
> and this bit of ast code
> 
> doreti_ast:
> 	/*
> 	 * Check for ASTs atomically with returning.  Disabling CPU
> 	 * interrupts provides sufficient locking even in the SMP case,
> 	 * since we will be informed of any new ASTs by an IPI.
> 	 */
> 	cli
> 	movq	PCPU(CURTHREAD),%rax
> 	testl	$TDF_ASTPENDING | TDF_NEEDRESCHED,TD_FLAGS(%rax)
> 	je	doreti_exit
> 	sti
> 	movq	%rsp,%rdi	/* pass a pointer to the trapframe */
> 	call	ast
> 	jmp	doreti_ast
> 
> 
> The above commit seems to be migrating loaded threads to another CPU.

How did you infer that?  The long-term load balancer should be running
fairly infrequently.

As a side note, I think we are missing ktrcsw() calls in some places,
e.g., in turnstile_wait().

> My test system is a VirtualBox amd64 FreeBSD 13.1 with one CPU running 
> on a 13.0 host.
> 
> I just tried restarting the VM with 2 CPUs and the testcase seems to be 
> a lot better - it's been running in a loop for 10 minutes whereas 
> previously it would hang at least 1 in 5 times.

Hmm.  Could you, please, show the ktrace output with -H -T passed to
kdump(1), together with fresh "procstat -kk" output?

The fact that the problem apparently only occurs with 1 CPU suggests a
scheduler bug, indeed.