Re: Hang ast / pipelk / piperd

Reply: Mark Johnston : "Re: Hang ast / pipelk / piperd"
In reply to: Paul Floyd : "Re: Hang ast / pipelk / piperd"
Go to: [ bottom of page ] [ top of archives ] [ this month ]
From: Jan Mikkelsen <janm_at_transactionware.com>
Date: Thu, 02 Jun 2022 10:49:45 UTC

> On 1 Jun 2022, at 23:27, Paul Floyd <paulf2718@gmail.com> wrote:
> 
> 
> 
> On 6/1/22 14:16, Mark Johnston wrote:
>> On Mon, May 30, 2022 at 09:35:05PM +0200, Paul Floyd wrote:
>>> 
>>> 
>>> On 5/30/22 14:15, Mark Johnston wrote:
>>> 
>>>> "procstat -kk <valgrind PID>" might help to reveal what's going on,
>>>> since it sounds like the hand/livelock is happening somewhere in the
>>>> kernel.
>>> 
>>> Not knowing much about the kernel, my guess is that this is related to
>>> 
>>> commit 4808bab7fa6c3ec49b49476b8326d7a0250a03fa
>>> Author: Alexander Motin <mav@FreeBSD.org>
>>> Date:   Tue Sep 21 18:14:22 2021 -0400
>>> 
>>>      sched_ule(4): Improve long-term load balancer.
>>> 
>>> and this bit of ast code
>>> 
>>> doreti_ast:
>>> 	/*
>>> 	 * Check for ASTs atomically with returning.  Disabling CPU
>>> 	 * interrupts provides sufficient locking even in the SMP case,
>>> 	 * since we will be informed of any new ASTs by an IPI.
>>> 	 */
>>> 	cli
>>> 	movq	PCPU(CURTHREAD),%rax
>>> 	testl	$TDF_ASTPENDING | TDF_NEEDRESCHED,TD_FLAGS(%rax)
>>> 	je	doreti_exit
>>> 	sti
>>> 	movq	%rsp,%rdi	/* pass a pointer to the trapframe */
>>> 	call	ast
>>> 	jmp	doreti_ast
>>> 
>>> 
>>> The above commit seems to be migrating loaded threads to another CPU.
>> How did you infer that?  The long-term load balancer should be running
>> fairly infrequently.
> 
> 
> Well one thread is hung in
> mi_switch+0xc2 ast+0x1e6 doreti_ast+0x1f
> 
> I found the above code referring to doreti_ast with grep and thought that if the test always fails then it will loop forever. So I looked for git commits containing TDF_ASTPENDING and TDF_NEEDRESCHED and found a couple of commits mentioning these flags, scheduling and found the above and the followup
> 
> commit 11f14b33629e552a451fdbfe653ebb0addd27700
> Author: Alexander Motin <mav@FreeBSD.org>
> Date:   Sun Sep 26 12:03:05 2021 -0400
> 
>    sched_ule(4): Fix hang with steal_thresh < 2.
> 
>    e745d729be60 caused infinite loop with interrupts disabled in load
>    stealing code if steal_thresh set below 2.  Such configuration should
>    not generally be used, but appeared some people are using it to
>    workaround some problems.
> 
> and I guessed they might be related.
> 
>> As a side note, I think we are missing ktrcsw() calls in some places,
>> e.g., in turnstile_wait().
>>> My test system is a VirtualBox amd64 FreeBSD 13.1 with one CPU running
>>> on a 13.0 host.
>>> 
>>> I just tried restarting the VM with 2 CPUs and the testcase seems to be
>>> a lot better - it's been running in a loop for 10 minutes whereas
>>> previously it would hang at least 1 in 5 times.
>> Hmm.  Could you, please, show the ktrace output with -H -T passed to
>> kdump(1), together with fresh "procstat -kk" output?
>> The fact that the problem apparently only occurs with 1 CPU suggests a
>> scheduler bug, indeed.
> 
> 
> Here is the fresh procstat output
> 
> paulf@freebsd:~ $ procstat -kk 12339
>  PID    TID COMM                TDNAME              KSTACK 
> 12339 100089 none-amd64-freebsd  -                   mi_switch+0xc2 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1f2 umtxq_sleep+0x143 do_wait+0x3e5 __umtx_op_wait+0x53 sys__umtx_op+0x7e amd64_syscall+0x10c fast_syscall_common+0xf8
> 
> 12339 100582 none-amd64-freebsd  -                   mi_switch+0xc2 sleepq_catch_signals+0x2e6 sleepq_timedwait_sig+0x12 _sleep+0x1d1 kern_clock_nanosleep+0x1c1 sys_nanosleep+0x3b amd64_syscall+0x10c fast_syscall_common+0xf8
> 
> 12339 100583 none-amd64-freebsd  -                   mi_switch+0xc2 ast+0x1e6 doreti_ast+0x1f
> 
> 12339 100584 none-amd64-freebsd  -                   mi_switch+0xc2 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1f2 pipe_read+0x3d6 dofileread+0x81 sys_read+0xbc amd64_syscall+0x10c fast_syscall_common+0xf8
> 
> 12339 100585 none-amd64-freebsd  -                   mi_switch+0xc2 intr_event_handle+0x167 intr_execute_handlers+0x4b Xapic_isr1+0xdc sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1f2 pipe_read+0x3d6 dofileread+0x81 sys_read+0xbc amd64_syscall+0x10c fast_syscall_common+0xf8
> [ … ]

All these mi_switch+0xc2 hangs reminded me of something I saw once on 13.1-RC2 back in April. The machine was running five concurrent “make -j32 installword” processes.

The machine hung, disk activity stopped. Results of ^T on various running commands:

^T on a “tail -F” command:

load: 1.93  cmd: tail 27541 [zfs teardown inactive] 393.65r 0.06u 0.10s 0% 2548k
mi_switch+0xc2 _sleep+0x1fc rms_rlock_fallback+0x90 zfs_freebsd_reclaim+0x26 VOP_RECLAIM_APV+0x1f vgonel+0x342 vnlru_free_impl+0x2f7 vn_alloc_hard+0xc8 getnewvnode_reserve+0x93 zfs_zget+0x22 zfs_dirent_lookup+0x16b zfs_dirlook+0x7a zfs_lookup+0x3d0 zfs_cache_lookup+0xa9 VOP_LOOKUP+0x30 cache_fplookup_noentry+0x1a3 cache_fplookup+0x366 namei+0x12a 

^T on a zsh doing a cd to a UFS directory:

load: 0.48  cmd: zsh 86937 [zfs teardown inactive] 84663.01r 0.06u 0.01s 0% 6412k
mi_switch+0xc2 _sleep+0x1fc rms_rlock_fallback+0x90 zfs_freebsd_reclaim+0x26 VOP_RECLAIM_APV+0x1f vgonel+0x342 vnlru_free_impl+0x2f7 vn_alloc_hard+0xc8 getnewvnode_reserve+0x93 zfs_zget+0x22 zfs_dirent_lookup+0x16b zfs_dirlook+0x7a zfs_lookup+0x3d0 zfs_cache_lookup+0xa9 lookup+0x45c namei+0x259 kern_statat+0xf3 sys_fstatat+0x2f 


^T on an attempt to start gstat

load: 0.17  cmd: gstat 63307 [ufs] 298.29r 0.00u 0.00s 0% 228k
mi_switch+0xc2 sleeplk+0xf6 lockmgr_slock_hard+0x3e7 ffs_lock+0x6c _vn_lock+0x48 vget_finish+0x21 cache_lookup+0x26c vfs_cache_lookup+0x7b lookup+0x45c namei+0x259 vn_open_cred+0x533 kern_openat+0x283 amd64_syscall+0x10c fast_syscall_common+0xf8 

A short press of the system power button did nothing.

The installworld target directories were on a ZFS filesystem with a single mirror of two SATA SSDs.

Unsure if it’s related because the rest of the stack traces are different. However, the mi_switch+0xc2 triggered a memory.

Regards,

Jan M.