Re: swap_pager: cannot allocate bio

From: Mark Johnston <markj_at_freebsd.org>
Date: Sat, 20 Nov 2021 18:23:06 UTC
On Fri, Nov 19, 2021 at 10:35:52PM -0500, Chris Ross wrote:
> (Sorry that the subject on this thread may not be relevant any more, but I don’t want to disconnect the thread.)
> 
> > On Nov 15, 2021, at 13:17, Chris Ross <cross+freebsd@distal.com> wrote:
> >> On Nov 15, 2021, at 10:08, Andriy Gapon <avg@freebsd.org> wrote:
> > 
> >> Yes, I propose to remove the wait for ARC evictions from arc_lowmem().
> >> 
> >> Another thing that may help a bit is having a greater "slack" between a threshold where the page daemon starts paging out and a threshold where memory allocations start to wait (via vm_wait_domain).
> >> 
> >> Also, I think that for a long time we had a problem (but not sure if it's still present) where allocations succeeded without waiting until the free memory went below certain threshold M, but once a thread started waiting in vm_wait it would not be woken up until the free memory went above another threshold N.  And the problem was that N >> M.  In other words, a lot of memory had to be freed (and not grabbed by other threads) before the waiting thread would be woken up.
> > 
> > Thank you both for your inputs.  Let me know if you’d like me to try anything, and I’ll kick (reboot) the system and can build a new kernel when you’d like.  I did get another procstat -kka out of it this morning, and the system has since gone less responsive, but I assume that new procstat won’t show anything last night’s didn’t.
> 
> I’m still having this issue.  I rebooted the machine, fsck’d the disks, and got it running again.  Again, it ran for ~50 hours before getting stuck.  I got another procstat-kka off of it, let me know if you’d like a copy of it.  But, it looks like the active processes are all in arc_wait_for_eviction.  A pagedaemon is in a arc_wait_for_eviction under a arc_lowmem, but the python processes that were doing the real work don’t have arc_lowmem in their stacks, just the arc_wait_for_eviction.
> 
> Please let me know if there’s anything I can do to assist in finding a remedy for this.  Thank you.

Here is a patch which tries to address the proximate cause of the
problem.  It would be helpful to know if it addresses the deadlocks
you're seeing.  I tested it lightly by putting a NUMA system under
memory pressure using postgres.

diff --git a/sys/contrib/openzfs/include/os/freebsd/spl/sys/kmem.h b/sys/contrib/openzfs/include/os/freebsd/spl/sys/kmem.h
index dc3b4f5d7877..4792a0b29ecf 100644
--- a/sys/contrib/openzfs/include/os/freebsd/spl/sys/kmem.h
+++ b/sys/contrib/openzfs/include/os/freebsd/spl/sys/kmem.h
@@ -45,7 +45,7 @@ MALLOC_DECLARE(M_SOLARIS);
 #define	POINTER_INVALIDATE(pp)	(*(pp) = (void *)((uintptr_t)(*(pp)) | 0x1))
 
 #define	KM_SLEEP		M_WAITOK
-#define	KM_PUSHPAGE		M_WAITOK
+#define	KM_PUSHPAGE		(M_WAITOK | M_USE_RESERVE) /* XXXMJ */
 #define	KM_NOSLEEP		M_NOWAIT
 #define	KM_NORMALPRI		0
 #define	KMC_NODEBUG		UMA_ZONE_NODUMP
diff --git a/sys/contrib/openzfs/module/zfs/arc.c b/sys/contrib/openzfs/module/zfs/arc.c
index 79e2d4381830..50cd45d76c52 100644
--- a/sys/contrib/openzfs/module/zfs/arc.c
+++ b/sys/contrib/openzfs/module/zfs/arc.c
@@ -4188,11 +4188,13 @@ arc_evict_state(arc_state_t *state, uint64_t spa, uint64_t bytes,
 	 * pick up where we left off for each individual sublist, rather
 	 * than starting from the tail each time.
 	 */
-	markers = kmem_zalloc(sizeof (*markers) * num_sublists, KM_SLEEP);
+	markers = kmem_zalloc(sizeof (*markers) * num_sublists,
+	    KM_SLEEP | KM_PUSHPAGE);
 	for (int i = 0; i < num_sublists; i++) {
 		multilist_sublist_t *mls;
 
-		markers[i] = kmem_cache_alloc(hdr_full_cache, KM_SLEEP);
+		markers[i] = kmem_cache_alloc(hdr_full_cache,
+		    KM_SLEEP | KM_PUSHPAGE);
 
 		/*
 		 * A b_spa of 0 is used to indicate that this header is
diff --git a/sys/vm/uma_core.c b/sys/vm/uma_core.c
index 7b83d81a423d..3fc7859387e0 100644
--- a/sys/vm/uma_core.c
+++ b/sys/vm/uma_core.c
@@ -3932,7 +3932,8 @@ keg_fetch_slab(uma_keg_t keg, uma_zone_t zone, int rdomain, const int flags)
 		vm_domainset_iter_policy_ref_init(&di, &keg->uk_dr, &domain,
 		    &aflags);
 	} else {
-		aflags = flags;
+		aflags = (flags & M_USE_RESERVE) != 0 ?
+		    (flags & ~M_WAITOK) | M_NOWAIT : flags;
 		domain = rdomain;
 	}