svn commit: r330296 - in user/markj/vm-playground/sys: amd64/include kern vm

Fri Mar 2 21:50:03 UTC 2018

Author: markj
Date: Fri Mar  2 21:50:02 2018
New Revision: 330296
URL: https://svnweb.freebsd.org/changeset/base/330296

Log:
  Add queues for batching page queue operations and page frees.
  
  As an alternative to the approach taken in r328860, relax the locking
  protocol for the queue field of struct vm_page and introduce per-CPU
  batch queues for each page queue and for each free queue. This approach
  reduces lock contention for enqueue, dequeue, and requeue operations
  by separating logical and physical queue state. In general, logical
  queue state is protected by the page lock, while physical queue state
  is protected by a page queue lock. Queue state is encoded in the
  queue and aflags fields. When performing a queue operation on a page,
  the logical operation is performed first, with the page lock held,
  and the physical operation is deferred using a batch queue. Physical
  operations may be deferred indefinitely (in particular, until after
  the page has been freed), but the number of pages whose logical and
  physical queue states do not match is bounded by a small number.
  
  The queue state of pages is also now decoupled from the allocation
  state: pages may be freed without having been physically dequeued
  (though they must be logically dequeued). The page allocators ensure
  that pages have been physically dequeued before a page is reused. One
  consequence of this is that page queue locks must now be leaf locks.
  As a result, active queue scanning is modified to work the same as
  inactive and laundry queue scanning do, so the active queue lock is not
  held when calling into the pmap layer during a scan.
  
  The queue field now encodes the logical queue state of the page, and
  the new PGA_ENQUEUED flag indicates whether the page is physically
  enqueued. To update the queue field of a page, the queue lock for its
  old value must be held: the page queue lock if the value is not PQ_NONE,
  and the page lock otherwise. When performing such an update, one of the
  new or old values must be PQ_NONE. To enqueue a page, the queue field
  is updated to the index of the queue; later, the page is physically
  enqueued while the page queue lock is held, and PGA_ENQUEUED is set.
  The PGA_ENQUEUED flag may only be set or cleared with the corresponding
  page queue lock held.
  
  Logical dequeues and requeues are requested using the PGA_DEQUEUE and
  PGA_REQUEUE flags, respectively. Both must be set with the page lock
  held, and can only be cleared once the corresponding physical operation
  has been performed, with the page queue lock held. As mentioned above,
  pages must be at least logically dequeued before being freed.
  
  The inactive queue scanning algorithm is changed to exploit the relaxed
  locking protocol. Rather than acquire the inactive queue lock once
  per page during a scan, we collect and physically dequeue a batch of
  pages, which is then processed using only the page and object locks.
  The fact that logical queue state is encoded in the page's atomic flags
  allows the page daemon to synchronize with a thread which is
  simultaneously freeing pages from the object.
  
  While this approach brings with it considerable complexity, it also
  allows some simplification of existing code. For instance, the page
  queue lock dance in vm_object_terminate() goes away since the dequeues
  are automatically batched. We also no longer need to use
  vm_pageout_fallback_object_lock() in the inactive queue scan. The
  laundry queue scan still uses it, but it is not required.

Modified:
  user/markj/vm-playground/sys/amd64/include/vmparam.h
  user/markj/vm-playground/sys/kern/subr_witness.c
  user/markj/vm-playground/sys/vm/vm_object.c
  user/markj/vm-playground/sys/vm/vm_page.c
  user/markj/vm-playground/sys/vm/vm_page.h
  user/markj/vm-playground/sys/vm/vm_pageout.c
  user/markj/vm-playground/sys/vm/vm_pagequeue.h
  user/markj/vm-playground/sys/vm/vm_phys.c

Modified: user/markj/vm-playground/sys/amd64/include/vmparam.h
==============================================================================

--- user/markj/vm-playground/sys/amd64/include/vmparam.h	Fri Mar  2 21:26:48 2018	(r330295)
+++ user/markj/vm-playground/sys/amd64/include/vmparam.h	Fri Mar  2 21:50:02 2018	(r330296)
@@ -227,4 +227,10 @@
 
 #define	ZERO_REGION_SIZE	(2 * 1024 * 1024)	/* 2MB */
 
+/*
+ * Use a fairly large batch size since we expect amd64 systems to have
+ * lots of memory.
+ */
+#define	VM_BATCHQUEUE_SIZE	31
+
 #endif /* _MACHINE_VMPARAM_H_ */

Modified: user/markj/vm-playground/sys/kern/subr_witness.c
==============================================================================
--- user/markj/vm-playground/sys/kern/subr_witness.c	Fri Mar  2 21:26:48 2018	(r330295)
+++ user/markj/vm-playground/sys/kern/subr_witness.c	Fri Mar  2 21:50:02 2018	(r330296)
@@ -601,7 +601,6 @@ static struct witness_order_list_entry order_lists[] =
 	 * CDEV
 	 */
 	{ "vm map (system)", &lock_class_mtx_sleep },
-	{ "vm pagequeue", &lock_class_mtx_sleep },
 	{ "vnode interlock", &lock_class_mtx_sleep },
 	{ "cdev", &lock_class_mtx_sleep },
 	{ NULL, NULL },
@@ -611,11 +610,11 @@ static struct witness_order_list_entry order_lists[] =
 	{ "vm map (user)", &lock_class_sx },
 	{ "vm object", &lock_class_rw },
 	{ "vm page", &lock_class_mtx_sleep },
-	{ "vm pagequeue", &lock_class_mtx_sleep },
 	{ "pmap pv global", &lock_class_rw },
 	{ "pmap", &lock_class_mtx_sleep },
 	{ "pmap pv list", &lock_class_rw },
 	{ "vm page free queue", &lock_class_mtx_sleep },
+	{ "vm pagequeue", &lock_class_mtx_sleep },
 	{ NULL, NULL },
 	/*
 	 * kqueue/VFS interaction

Modified: user/markj/vm-playground/sys/vm/vm_object.c
==============================================================================
--- user/markj/vm-playground/sys/vm/vm_object.c	Fri Mar  2 21:26:48 2018	(r330295)
+++ user/markj/vm-playground/sys/vm/vm_object.c	Fri Mar  2 21:50:02 2018	(r330296)
@@ -721,14 +721,11 @@ static void
 vm_object_terminate_pages(vm_object_t object)
 {
 	vm_page_t p, p_next;
-	struct mtx *mtx, *mtx1;
-	struct vm_pagequeue *pq, *pq1;
-	int dequeued;
+	struct mtx *mtx;
 
 	VM_OBJECT_ASSERT_WLOCKED(object);
 
 	mtx = NULL;
-	pq = NULL;
 
 	/*
 	 * Free any remaining pageable pages.  This also removes them from the
@@ -738,60 +735,23 @@ vm_object_terminate_pages(vm_object_t object)
 	 */
 	TAILQ_FOREACH_SAFE(p, &object->memq, listq, p_next) {
 		vm_page_assert_unbusied(p);
-		if ((object->flags & OBJ_UNMANAGED) == 0) {
+		if ((object->flags & OBJ_UNMANAGED) == 0)
 			/*
 			 * vm_page_free_prep() only needs the page
 			 * lock for managed pages.
 			 */
-			mtx1 = vm_page_lockptr(p);
-			if (mtx1 != mtx) {
-				if (mtx != NULL)
-					mtx_unlock(mtx);
-				if (pq != NULL) {
-					vm_pagequeue_cnt_add(pq, dequeued);
-					vm_pagequeue_unlock(pq);
-					pq = NULL;
-				}
-				mtx = mtx1;
-				mtx_lock(mtx);
-			}
-		}
+			vm_page_change_lock(p, &mtx);
 		p->object = NULL;
 		if (p->wire_count != 0)
-			goto unlist;
+			continue;
 		VM_CNT_INC(v_pfree);
 		p->flags &= ~PG_ZERO;
-		if (p->queue != PQ_NONE) {
-			KASSERT(p->queue < PQ_COUNT, ("vm_object_terminate: "
-			    "page %p is not queued", p));
-			pq1 = vm_page_pagequeue(p);
-			if (pq != pq1) {
-				if (pq != NULL) {
-					vm_pagequeue_cnt_add(pq, dequeued);
-					vm_pagequeue_unlock(pq);
-				}
-				pq = pq1;
-				vm_pagequeue_lock(pq);
-				dequeued = 0;
-			}
-			p->queue = PQ_NONE;
-			TAILQ_REMOVE(&pq->pq_pl, p, plinks.q);
-			dequeued--;
-		}
-		if (vm_page_free_prep(p, true))
-			continue;
-unlist:
-		TAILQ_REMOVE(&object->memq, p, listq);
+
+		vm_page_free(p);
 	}
-	if (pq != NULL) {
-		vm_pagequeue_cnt_add(pq, dequeued);
-		vm_pagequeue_unlock(pq);
-	}
 	if (mtx != NULL)
 		mtx_unlock(mtx);
 
-	vm_page_free_phys_pglist(&object->memq);
-
 	/*
 	 * If the object contained any pages, then reset it to an empty state.
 	 * None of the object's fields, including "resident_page_count", were
@@ -1974,7 +1934,6 @@ vm_object_page_remove(vm_object_t object, vm_pindex_t 
 {
 	vm_page_t p, next;
 	struct mtx *mtx;
-	struct pglist pgl;
 
 	VM_OBJECT_ASSERT_WLOCKED(object);
 	KASSERT((object->flags & OBJ_UNMANAGED) == 0 ||
@@ -1983,7 +1942,6 @@ vm_object_page_remove(vm_object_t object, vm_pindex_t 
 	if (object->resident_page_count == 0)
 		return;
 	vm_object_pip_add(object, 1);
-	TAILQ_INIT(&pgl);
 again:
 	p = vm_page_find_least(object, start);
 	mtx = NULL;
@@ -2038,12 +1996,10 @@ again:
 		if ((options & OBJPR_NOTMAPPED) == 0 && object->ref_count != 0)
 			pmap_remove_all(p);
 		p->flags &= ~PG_ZERO;
-		if (vm_page_free_prep(p, false))
-			TAILQ_INSERT_TAIL(&pgl, p, listq);
+		vm_page_free(p);
 	}
 	if (mtx != NULL)
 		mtx_unlock(mtx);
-	vm_page_free_phys_pglist(&pgl);
 	vm_object_pip_wakeup(object);
 }
 

Modified: user/markj/vm-playground/sys/vm/vm_page.c
==============================================================================
--- user/markj/vm-playground/sys/vm/vm_page.c	Fri Mar  2 21:26:48 2018	(r330295)
+++ user/markj/vm-playground/sys/vm/vm_page.c	Fri Mar  2 21:50:02 2018	(r330296)
@@ -131,13 +131,11 @@ extern int	uma_startup_count(int);
 extern void	uma_startup(void *, int);
 extern int	vmem_startup_count(void);
 
-/*
- *	Associated with page of user-allocatable memory is a
- *	page structure.
- */
-
 struct vm_domain vm_dom[MAXMEMDOM];
 
+static DPCPU_DEFINE(struct vm_batchqueue, pqbatch[MAXMEMDOM][PQ_COUNT]);
+static DPCPU_DEFINE(struct vm_batchqueue, freeqbatch[MAXMEMDOM]);
+
 struct mtx_padalign __exclusive_cache_line pa_lock[PA_LOCK_COUNT];
 
 /* The following fields are protected by the domainset lock. */
@@ -176,7 +174,7 @@ static uma_zone_t fakepg_zone;
 
 static void vm_page_alloc_check(vm_page_t m);
 static void vm_page_clear_dirty_mask(vm_page_t m, vm_page_bits_t pagebits);
-static void vm_page_enqueue(uint8_t queue, vm_page_t m);
+static void vm_page_enqueue_lazy(vm_page_t m, uint8_t queue);
 static void vm_page_free_phys(struct vm_domain *vmd, vm_page_t m);
 static void vm_page_init(void *dummy);
 static int vm_page_insert_after(vm_page_t m, vm_object_t object,
@@ -1814,6 +1812,7 @@ again:
 	KASSERT(m != NULL, ("missing page"));
 
 found:
+	vm_page_dequeue(m);
 	vm_page_alloc_check(m);
 
 	/*
@@ -1987,8 +1986,10 @@ again:
 	}
 
 done:
-	for (i = 0; i < nalloc; i++)
+	for (i = 0; i < nalloc; i++) {
+		vm_page_dequeue(ma[i]);
 		vm_page_alloc_check(ma[i]);
+	}
 
 	/*
 	 * Initialize the pages.  Only the PG_ZERO flag is inherited.
@@ -2195,8 +2196,10 @@ again:
 #if VM_NRESERVLEVEL > 0
 found:
 #endif
-	for (m = m_ret; m < &m_ret[npages]; m++)
+	for (m = m_ret; m < &m_ret[npages]; m++) {
+		vm_page_dequeue(m);
 		vm_page_alloc_check(m);
+	}
 
 	/*
 	 * Initialize the pages.  Only the PG_ZERO flag is inherited.
@@ -2273,6 +2276,8 @@ vm_page_alloc_check(vm_page_t m)
 	KASSERT(m->object == NULL, ("page %p has object", m));
 	KASSERT(m->queue == PQ_NONE,
 	    ("page %p has unexpected queue %d", m, m->queue));
+	KASSERT((m->aflags & PGA_QUEUE_STATE_MASK) == 0,
+	    ("page %p has unexpected queue state", m));
 	KASSERT(!vm_page_held(m), ("page %p is held", m));
 	KASSERT(!vm_page_busied(m), ("page %p is busy", m));
 	KASSERT(m->dirty == 0, ("page %p is dirty", m));
@@ -2342,6 +2347,7 @@ again:
 			goto again;
 		return (NULL);
 	}
+	vm_page_dequeue(m);
 	vm_page_alloc_check(m);
 
 	/*
@@ -2534,6 +2540,7 @@ retry:
 				    vm_reserv_size(level)) - pa);
 #endif
 			} else if (object->memattr == VM_MEMATTR_DEFAULT &&
+			    /* XXX need to check PGA_DEQUEUE */
 			    m->queue != PQ_NONE && !vm_page_busied(m)) {
 				/*
 				 * The page is allocated but eligible for
@@ -2686,6 +2693,7 @@ retry:
 			else if (object->memattr != VM_MEMATTR_DEFAULT)
 				error = EINVAL;
 			else if (m->queue != PQ_NONE && !vm_page_busied(m)) {
+				/* XXX need to check PGA_DEQUEUE */
 				KASSERT(pmap_page_get_memattr(m) ==
 				    VM_MEMATTR_DEFAULT,
 				    ("page %p has an unexpected memattr", m));
@@ -3213,113 +3221,288 @@ vm_page_pagequeue(vm_page_t m)
 	return (&vm_pagequeue_domain(m)->vmd_pagequeues[m->queue]);
 }
 
+static struct mtx *
+vm_page_pagequeue_lockptr(vm_page_t m)
+{
+
+	if (m->queue == PQ_NONE)
+		return (NULL);
+	return (&vm_page_pagequeue(m)->pq_mutex);
+}
+
+static void
+vm_pqbatch_process(struct vm_pagequeue *pq, struct vm_batchqueue *bq,
+    uint8_t queue)
+{
+	vm_page_t m;
+	int delta;
+	uint8_t aflags;
+
+	vm_pagequeue_assert_locked(pq);
+
+	delta = 0;
+	VM_BATCHQ_FOREACH(bq, m) {
+		if (__predict_false(m->queue != queue))
+			continue;
+
+		aflags = m->aflags;
+		if ((aflags & PGA_DEQUEUE) != 0) {
+			if (__predict_true((aflags & PGA_ENQUEUED) != 0)) {
+				TAILQ_REMOVE(&pq->pq_pl, m, plinks.q);
+				delta--;
+			}
+
+			/*
+			 * Synchronize with the page daemon, which may be
+			 * simultaneously scanning this page with only the page
+			 * lock held.  We must be careful to avoid leaving the
+			 * page in a state where it appears to belong to a page
+			 * queue.
+			 */
+			m->queue = PQ_NONE;
+			atomic_thread_fence_rel();
+			vm_page_aflag_clear(m, PGA_QUEUE_STATE_MASK);
+		} else if ((aflags & PGA_ENQUEUED) == 0) {
+			TAILQ_INSERT_TAIL(&pq->pq_pl, m, plinks.q);
+			delta++;
+			vm_page_aflag_set(m, PGA_ENQUEUED);
+			if (__predict_false((aflags & PGA_REQUEUE) != 0))
+				vm_page_aflag_clear(m, PGA_REQUEUE);
+		} else if ((aflags & PGA_REQUEUE) != 0) {
+			TAILQ_REMOVE(&pq->pq_pl, m, plinks.q);
+			TAILQ_INSERT_TAIL(&pq->pq_pl, m, plinks.q);
+			vm_page_aflag_clear(m, PGA_REQUEUE);
+		}
+	}
+	vm_batchqueue_init(bq);
+	vm_pagequeue_cnt_add(pq, delta);
+}
+
 /*
- *	vm_page_dequeue:
+ *	vm_page_dequeue_lazy:
  *
- *	Remove the given page from its current page queue.
+ *	Request removal of the given page from its current page
+ *	queue.  Physical removal from the queue may be deferred
+ *	arbitrarily, and may be cancelled by later queue operations
+ *	on that page.
  *
  *	The page must be locked.
  */
-void
-vm_page_dequeue(vm_page_t m)
+static void
+vm_page_dequeue_lazy(vm_page_t m)
 {
+	struct vm_batchqueue *bq;
 	struct vm_pagequeue *pq;
+	int domain, queue;
 
 	vm_page_assert_locked(m);
-	KASSERT(m->queue < PQ_COUNT, ("vm_page_dequeue: page %p is not queued",
-	    m));
-	pq = vm_page_pagequeue(m);
-	vm_pagequeue_lock(pq);
-	m->queue = PQ_NONE;
-	TAILQ_REMOVE(&pq->pq_pl, m, plinks.q);
-	vm_pagequeue_cnt_dec(pq);
+
+	queue = m->queue;
+	if (queue == PQ_NONE)
+		return;
+	domain = vm_phys_domain(m);
+	pq = &VM_DOMAIN(domain)->vmd_pagequeues[queue];
+
+	vm_page_aflag_set(m, PGA_DEQUEUE);
+
+	critical_enter();
+	bq = DPCPU_PTR(pqbatch[domain][queue]);
+	if (vm_batchqueue_insert(bq, m)) {
+		critical_exit();
+		return;
+	}
+	if (!vm_pagequeue_trylock(pq)) {
+		critical_exit();
+		vm_pagequeue_lock(pq);
+		critical_enter();
+		bq = DPCPU_PTR(pqbatch[domain][queue]);
+	}
+	vm_pqbatch_process(pq, bq, queue);
+
+	/*
+	 * The page may have been dequeued by another thread before we
+	 * acquired the page queue lock.  However, since we hold the
+	 * page lock, the page's queue field cannot change a second
+	 * time and we can safely clear PGA_DEQUEUE.
+	 */
+	KASSERT(m->queue == queue || m->queue == PQ_NONE,
+	    ("%s: page %p migrated between queues", __func__, m));
+	if (m->queue == queue) {
+		(void)vm_batchqueue_insert(bq, m);
+		vm_pqbatch_process(pq, bq, queue);
+	} else
+		vm_page_aflag_clear(m, PGA_DEQUEUE);
 	vm_pagequeue_unlock(pq);
+	critical_exit();
 }
 
 /*
  *	vm_page_dequeue_locked:
  *
- *	Remove the given page from its current page queue.
+ *	Remove the page from its page queue, which must be locked.
+ *	If the page lock is not held, there is no guarantee that the
+ *	page will not be enqueued by another thread before this function
+ *	returns.  In this case, it is up to the caller to ensure that
+ *	no other threads hold a reference to the page.
  *
- *	The page and page queue must be locked.
+ *	The page queue lock must be held.  If the page is not already
+ *	logically dequeued, the page lock must be held as well.
  */
 void
 vm_page_dequeue_locked(vm_page_t m)
 {
 	struct vm_pagequeue *pq;
 
-	vm_page_lock_assert(m, MA_OWNED);
-	pq = vm_page_pagequeue(m);
-	vm_pagequeue_assert_locked(pq);
+	KASSERT(m->queue != PQ_NONE,
+	    ("%s: page %p queue field is PQ_NONE", __func__, m));
+	vm_pagequeue_assert_locked(vm_page_pagequeue(m));
+	KASSERT((m->aflags & PGA_DEQUEUE) != 0 ||
+	    mtx_owned(vm_page_lockptr(m)),
+	    ("%s: queued unlocked page %p", __func__, m));
+
+	if ((m->aflags & PGA_ENQUEUED) != 0) {
+		pq = vm_page_pagequeue(m);
+		TAILQ_REMOVE(&pq->pq_pl, m, plinks.q);
+		vm_pagequeue_cnt_dec(pq);
+	}
+
+	/*
+	 * Synchronize with the page daemon, which may be simultaneously
+	 * scanning this page with only the page lock held.  We must be careful
+	 * to avoid leaving the page in a state where it appears to belong to a
+	 * page queue.
+	 */
 	m->queue = PQ_NONE;
-	TAILQ_REMOVE(&pq->pq_pl, m, plinks.q);
-	vm_pagequeue_cnt_dec(pq);
+	atomic_thread_fence_rel();
+	vm_page_aflag_clear(m, PGA_QUEUE_STATE_MASK);
 }
 
 /*
- *	vm_page_enqueue:
+ *	vm_page_dequeue:
  *
- *	Add the given page to the specified page queue.
+ *	Remove the page from whichever page queue it's in, if any.
+ *	If the page lock is not held, there is no guarantee that the
+ *	page will not be enqueued by another thread before this function
+ *	returns.  In this case, it is up to the caller to ensure that
+ *	no other threads hold a reference to the page.
+ */
+void
+vm_page_dequeue(vm_page_t m)
+{
+	struct mtx *lock, *lock1;
+
+	lock = vm_page_pagequeue_lockptr(m);
+	for (;;) {
+		if (lock == NULL)
+			return;
+		mtx_lock(lock);
+		if ((lock1 = vm_page_pagequeue_lockptr(m)) == lock)
+			break;
+		mtx_unlock(lock);
+		lock = lock1;
+	}
+	KASSERT(lock == vm_page_pagequeue_lockptr(m),
+	    ("%s: page %p migrated directly between queues", __func__, m));
+	vm_page_dequeue_locked(m);
+	mtx_unlock(lock);
+}
+
+/*
+ *	vm_page_enqueue_lazy:
  *
+ *	Schedule the given page for insertion into the specified page queue.
+ *	Physical insertion of the page may be deferred indefinitely.
+ *
  *	The page must be locked.
  */
 static void
-vm_page_enqueue(uint8_t queue, vm_page_t m)
+vm_page_enqueue_lazy(vm_page_t m, uint8_t queue)
 {
+	struct vm_batchqueue *bq;
 	struct vm_pagequeue *pq;
+	int domain;
 
-	vm_page_lock_assert(m, MA_OWNED);
-	KASSERT(queue < PQ_COUNT,
-	    ("vm_page_enqueue: invalid queue %u request for page %p",
-	    queue, m));
+	vm_page_assert_locked(m);
+	KASSERT(m->queue == PQ_NONE && (m->aflags & PGA_QUEUE_STATE_MASK) == 0,
+	    ("%s: page %p is already enqueued", __func__, m));
+
+	domain = vm_phys_domain(m);
 	pq = &vm_pagequeue_domain(m)->vmd_pagequeues[queue];
-	vm_pagequeue_lock(pq);
+
+	/*
+	 * The queue field might be changed back to PQ_NONE by a concurrent
+	 * call to vm_page_dequeue().  In that case the batch queue entry will
+	 * be a no-op.
+	 */
 	m->queue = queue;
-	TAILQ_INSERT_TAIL(&pq->pq_pl, m, plinks.q);
-	vm_pagequeue_cnt_inc(pq);
+
+	critical_enter();
+	bq = DPCPU_PTR(pqbatch[domain][queue]);
+	if (__predict_true(vm_batchqueue_insert(bq, m))) {
+		critical_exit();
+		return;
+	}
+	if (!vm_pagequeue_trylock(pq)) {
+		critical_exit();
+		vm_pagequeue_lock(pq);
+		critical_enter();
+		bq = DPCPU_PTR(pqbatch[domain][queue]);
+	}
+	vm_pqbatch_process(pq, bq, queue);
+	(void)vm_batchqueue_insert(bq, m);
+	vm_pqbatch_process(pq, bq, queue);
 	vm_pagequeue_unlock(pq);
+	critical_exit();
 }
 
 /*
  *	vm_page_requeue:
  *
- *	Move the given page to the tail of its current page queue.
+ *	Schedule a requeue of the given page.
  *
  *	The page must be locked.
  */
 void
 vm_page_requeue(vm_page_t m)
 {
+	struct vm_batchqueue *bq;
 	struct vm_pagequeue *pq;
+	int domain, queue;
 
 	vm_page_lock_assert(m, MA_OWNED);
 	KASSERT(m->queue != PQ_NONE,
-	    ("vm_page_requeue: page %p is not queued", m));
+	    ("%s: page %p is not enqueued", __func__, m));
+
+	domain = vm_phys_domain(m);
+	queue = m->queue;
 	pq = vm_page_pagequeue(m);
-	vm_pagequeue_lock(pq);
-	TAILQ_REMOVE(&pq->pq_pl, m, plinks.q);
-	TAILQ_INSERT_TAIL(&pq->pq_pl, m, plinks.q);
-	vm_pagequeue_unlock(pq);
-}
 
-/*
- *	vm_page_requeue_locked:
- *
- *	Move the given page to the tail of its current page queue.
- *
- *	The page queue must be locked.
- */
-void
-vm_page_requeue_locked(vm_page_t m)
-{
-	struct vm_pagequeue *pq;
+	if (queue == PQ_NONE)
+		return;
 
-	KASSERT(m->queue != PQ_NONE,
-	    ("vm_page_requeue_locked: page %p is not queued", m));
-	pq = vm_page_pagequeue(m);
-	vm_pagequeue_assert_locked(pq);
-	TAILQ_REMOVE(&pq->pq_pl, m, plinks.q);
-	TAILQ_INSERT_TAIL(&pq->pq_pl, m, plinks.q);
+	vm_page_aflag_set(m, PGA_REQUEUE);
+	critical_enter();
+	bq = DPCPU_PTR(pqbatch[domain][queue]);
+	if (__predict_true(vm_batchqueue_insert(bq, m))) {
+		critical_exit();
+		return;
+	}
+	if (!vm_pagequeue_trylock(pq)) {
+		critical_exit();
+		vm_pagequeue_lock(pq);
+		critical_enter();
+		bq = DPCPU_PTR(pqbatch[domain][queue]);
+	}
+	vm_pqbatch_process(pq, bq, queue);
+	KASSERT(m->queue == queue || m->queue == PQ_NONE,
+	    ("%s: page %p migrated between queues", __func__, m));
+	if (m->queue == queue) {
+		(void)vm_batchqueue_insert(bq, m);
+		vm_pqbatch_process(pq, bq, queue);
+	} else
+		vm_page_aflag_clear(m, PGA_REQUEUE);
+	vm_pagequeue_unlock(pq);
+	critical_exit();
 }
 
 /*
@@ -3337,18 +3520,18 @@ vm_page_activate(vm_page_t m)
 	int queue;
 
 	vm_page_lock_assert(m, MA_OWNED);
-	if ((queue = m->queue) != PQ_ACTIVE) {
-		if (m->wire_count == 0 && (m->oflags & VPO_UNMANAGED) == 0) {
-			if (m->act_count < ACT_INIT)
-				m->act_count = ACT_INIT;
-			if (queue != PQ_NONE)
-				vm_page_dequeue(m);
-			vm_page_enqueue(PQ_ACTIVE, m);
-		}
-	} else {
-		if (m->act_count < ACT_INIT)
+
+	if ((queue = m->queue) == PQ_ACTIVE || m->wire_count > 0 ||
+	    (m->oflags & VPO_UNMANAGED) != 0) {
+		if (queue == PQ_ACTIVE && m->act_count < ACT_INIT)
 			m->act_count = ACT_INIT;
+		return;
 	}
+
+	vm_page_remque(m);
+	if (m->act_count < ACT_INIT)
+		m->act_count = ACT_INIT;
+	vm_page_enqueue_lazy(m, PQ_ACTIVE);
 }
 
 /*
@@ -3359,11 +3542,10 @@ vm_page_activate(vm_page_t m)
  *	the page to the free list only if this function returns true.
  *
  *	The object must be locked.  The page must be locked if it is
- *	managed.  For a queued managed page, the pagequeue_locked
- *	argument specifies whether the page queue is already locked.
+ *	managed.
  */
 bool
-vm_page_free_prep(vm_page_t m, bool pagequeue_locked)
+vm_page_free_prep(vm_page_t m)
 {
 
 #if defined(DIAGNOSTIC) && defined(PHYS_TO_DMAP)
@@ -3402,12 +3584,14 @@ vm_page_free_prep(vm_page_t m, bool pagequeue_locked)
 		return (false);
 	}
 
-	if (m->queue != PQ_NONE) {
-		if (pagequeue_locked)
-			vm_page_dequeue_locked(m);
-		else
-			vm_page_dequeue(m);
-	}
+	/*
+	 * Pages need not be dequeued before they are returned to the physical
+	 * memory allocator, but they must at least be marked for a deferred
+	 * dequeue.
+	 */
+	if ((m->oflags & VPO_UNMANAGED) == 0)
+		vm_page_dequeue_lazy(m);
+
 	m->valid = 0;
 	vm_page_undirty(m);
 
@@ -3443,6 +3627,12 @@ static void
 vm_page_free_phys(struct vm_domain *vmd, vm_page_t m)
 {
 
+#if 0
+	/* XXX racy */
+	KASSERT((m->aflags & PGA_DEQUEUE) != 0 || m->queue == PQ_NONE,
+	    ("%s: page %p has lingering queue state", __func__, m));
+#endif
+
 	vm_domain_free_assert_locked(vmd);
 
 #if VM_NRESERVLEVEL > 0
@@ -3451,36 +3641,6 @@ vm_page_free_phys(struct vm_domain *vmd, vm_page_t m)
 		vm_phys_free_pages(m, 0);
 }
 
-void
-vm_page_free_phys_pglist(struct pglist *tq)
-{
-	struct vm_domain *vmd;
-	vm_page_t m;
-	int cnt;
-
-	if (TAILQ_EMPTY(tq))
-		return;
-	vmd = NULL;
-	cnt = 0;
-	TAILQ_FOREACH(m, tq, listq) {
-		if (vmd != vm_pagequeue_domain(m)) {
-			if (vmd != NULL) {
-				vm_domain_free_unlock(vmd);
-				vm_domain_freecnt_inc(vmd, cnt);
-				cnt = 0;
-			}
-			vmd = vm_pagequeue_domain(m);
-			vm_domain_free_lock(vmd);
-		}
-		vm_page_free_phys(vmd, m);
-		cnt++;
-	}
-	if (vmd != NULL) {
-		vm_domain_free_unlock(vmd);
-		vm_domain_freecnt_inc(vmd, cnt);
-	}
-}
-
 /*
  *	vm_page_free_toq:
  *
@@ -3493,15 +3653,32 @@ vm_page_free_phys_pglist(struct pglist *tq)
 void
 vm_page_free_toq(vm_page_t m)
 {
+	struct vm_batchqueue *cpubq, bq;
 	struct vm_domain *vmd;
+	int domain;
 
-	if (!vm_page_free_prep(m, false))
+	if (!vm_page_free_prep(m))
 		return;
-	vmd = vm_pagequeue_domain(m);
+
+	domain = vm_phys_domain(m);
+	vmd = VM_DOMAIN(domain);
+
+	critical_enter();
+	cpubq = DPCPU_PTR(freeqbatch[domain]);
+	if (vm_batchqueue_insert(cpubq, m)) {
+		critical_exit();
+		return;
+	}
+	memcpy(&bq, cpubq, sizeof(bq));
+	vm_batchqueue_init(cpubq);
+	critical_exit();
+
 	vm_domain_free_lock(vmd);
 	vm_page_free_phys(vmd, m);
+	VM_BATCHQ_FOREACH(&bq, m)
+		vm_page_free_phys(vmd, m);
 	vm_domain_free_unlock(vmd);
-	vm_domain_freecnt_inc(vmd, 1);
+	vm_domain_freecnt_inc(vmd, bq.bq_cnt + 1);
 }
 
 /*
@@ -3558,22 +3735,25 @@ vm_page_unwire(vm_page_t m, uint8_t queue)
 	KASSERT(queue < PQ_COUNT || queue == PQ_NONE,
 	    ("vm_page_unwire: invalid queue %u request for page %p",
 	    queue, m));
+	if ((m->oflags & VPO_UNMANAGED) == 0)
+		vm_page_assert_locked(m);
 
 	unwired = vm_page_unwire_noq(m);
-	if (unwired && (m->oflags & VPO_UNMANAGED) == 0 && m->object != NULL) {
-		if (m->queue == queue) {
+	if (!unwired || (m->oflags & VPO_UNMANAGED) != 0 || m->object == NULL)
+		return (unwired);
+
+	if (m->queue == queue) {
+		if (queue == PQ_ACTIVE)
+			vm_page_reference(m);
+		else if (queue != PQ_NONE)
+			vm_page_requeue(m);
+	} else {
+		vm_page_dequeue(m);
+		if (queue != PQ_NONE) {
+			vm_page_enqueue_lazy(m, queue);
 			if (queue == PQ_ACTIVE)
-				vm_page_reference(m);
-			else if (queue != PQ_NONE)
-				vm_page_requeue(m);
-		} else {
-			vm_page_remque(m);
-			if (queue != PQ_NONE) {
-				vm_page_enqueue(queue, m);
-				if (queue == PQ_ACTIVE)
-					/* Initialize act_count. */
-					vm_page_activate(m);
-			}
+				/* Initialize act_count. */
+				vm_page_activate(m);
 		}
 	}
 	return (unwired);
@@ -3620,7 +3800,7 @@ vm_page_unwire_noq(vm_page_t m)
  * The page must be locked.
  */
 static inline void
-_vm_page_deactivate(vm_page_t m, boolean_t noreuse)
+_vm_page_deactivate(vm_page_t m, bool noreuse)
 {
 	struct vm_pagequeue *pq;
 	int queue;
@@ -3629,31 +3809,34 @@ _vm_page_deactivate(vm_page_t m, boolean_t noreuse)
 
 	/*
 	 * Ignore if the page is already inactive, unless it is unlikely to be
-	 * reactivated.
+	 * reactivated.  Note that the test of m->queue is racy since the
+	 * inactive queue lock is not held.
 	 */
 	if ((queue = m->queue) == PQ_INACTIVE && !noreuse)
 		return;
-	if (m->wire_count == 0 && (m->oflags & VPO_UNMANAGED) == 0) {
-		pq = &vm_pagequeue_domain(m)->vmd_pagequeues[PQ_INACTIVE];
-		/* Avoid multiple acquisitions of the inactive queue lock. */
-		if (queue == PQ_INACTIVE) {
-			vm_pagequeue_lock(pq);
-			vm_page_dequeue_locked(m);
-		} else {
-			if (queue != PQ_NONE)
-				vm_page_dequeue(m);
-			vm_pagequeue_lock(pq);
-		}
+	if (m->wire_count > 0 || (m->oflags & VPO_UNMANAGED) != 0)
+		return;
+
+	/*
+	 * XXX we can do this with only one lock acquisition if m is already
+	 * in PQ_INACTIVE
+	 */
+	vm_page_remque(m);
+
+	pq = &vm_pagequeue_domain(m)->vmd_pagequeues[PQ_INACTIVE];
+	if (noreuse) {
+		/* This is a slow path. */
+		vm_pagequeue_lock(pq);
 		m->queue = PQ_INACTIVE;
-		if (noreuse)
-			TAILQ_INSERT_BEFORE(
-			    &vm_pagequeue_domain(m)->vmd_inacthead, m,
-			    plinks.q);
-		else
-			TAILQ_INSERT_TAIL(&pq->pq_pl, m, plinks.q);
+		TAILQ_INSERT_BEFORE(&vm_pagequeue_domain(m)->vmd_inacthead, m,
+		    plinks.q);
 		vm_pagequeue_cnt_inc(pq);
+		vm_page_aflag_set(m, PGA_ENQUEUED);
+		if ((m->aflags & PGA_REQUEUE) != 0)
+			vm_page_aflag_clear(m, PGA_REQUEUE);
 		vm_pagequeue_unlock(pq);
-	}
+	} else
+		vm_page_enqueue_lazy(m, PQ_INACTIVE);
 }
 
 /*
@@ -3665,7 +3848,7 @@ void
 vm_page_deactivate(vm_page_t m)
 {
 
-	_vm_page_deactivate(m, FALSE);
+	_vm_page_deactivate(m, false);
 }
 
 /*
@@ -3678,7 +3861,7 @@ void
 vm_page_deactivate_noreuse(vm_page_t m)
 {
 
-	_vm_page_deactivate(m, TRUE);
+	_vm_page_deactivate(m, true);
 }
 
 /*
@@ -3692,15 +3875,13 @@ vm_page_launder(vm_page_t m)
 	int queue;
 
 	vm_page_assert_locked(m);
-	if ((queue = m->queue) != PQ_LAUNDRY) {
-		if (m->wire_count == 0 && (m->oflags & VPO_UNMANAGED) == 0) {
-			if (queue != PQ_NONE)
-				vm_page_dequeue(m);
-			vm_page_enqueue(PQ_LAUNDRY, m);
-		} else
-			KASSERT(queue == PQ_NONE,
-			    ("wired page %p is queued", m));
-	}
+
+	if ((queue = m->queue) == PQ_LAUNDRY || m->wire_count > 0 ||
+	    (m->oflags & VPO_UNMANAGED) != 0)
+		return;
+
+	vm_page_remque(m);
+	vm_page_enqueue_lazy(m, PQ_LAUNDRY);
 }
 
 /*
@@ -3715,9 +3896,9 @@ vm_page_unswappable(vm_page_t m)
 	vm_page_assert_locked(m);
 	KASSERT(m->wire_count == 0 && (m->oflags & VPO_UNMANAGED) == 0,
 	    ("page %p already unswappable", m));
-	if (m->queue != PQ_NONE)
-		vm_page_dequeue(m);
-	vm_page_enqueue(PQ_UNSWAPPABLE, m);
+
+	vm_page_remque(m);
+	vm_page_enqueue_lazy(m, PQ_UNSWAPPABLE);
 }
 
 /*

Modified: user/markj/vm-playground/sys/vm/vm_page.h
==============================================================================
--- user/markj/vm-playground/sys/vm/vm_page.h	Fri Mar  2 21:26:48 2018	(r330295)
+++ user/markj/vm-playground/sys/vm/vm_page.h	Fri Mar  2 21:50:02 2018	(r330296)
@@ -94,7 +94,9 @@
  *	In general, operations on this structure's mutable fields are
  *	synchronized using either one of or a combination of the lock on the
  *	object that the page belongs to (O), the pool lock for the page (P),
- *	or the lock for either the free or paging queue (Q).  If a field is
+ *	the per-domain lock for the free queues (F), or the page's queue
+ *	lock (Q).  The queue lock for a page depends on the value of its
+ *	queue field and described in detail below.  If a field is
  *	annotated below with two of these locks, then holding either lock is
  *	sufficient for read access, but both locks are required for write
  *	access.  An annotation of (C) indicates that the field is immutable.
@@ -143,6 +145,28 @@
  *	causing the thread to block.  vm_page_sleep_if_busy() can be used to
  *	sleep until the page's busy state changes, after which the caller
  *	must re-lookup the page and re-evaluate its state.
+ *
+ *	The queue field is the index of the page queue containing the
+ *	page, or PQ_NONE if the page is not enqueued.  The queue lock of a
+ *	page is the page queue lock corresponding to the page queue index,
+ *	or the page lock (P) for the page.  To modify the queue field, the
+ *	queue lock for the old value of the field must be held.  It is
+ *	invalid for a page's queue field to transition between two distinct
+ *	page queue indices.  That is, when updating the queue field, either
+ *	the new value or the old value must be PQ_NONE.
+ *
+ *	To avoid contention on page queue locks, page queue operations
+ *	(enqueue, dequeue, requeue) are batched using per-CPU queues.
+ *	A deferred operation is requested by inserting an entry into a
+ *	batch queue; the entry is simply a pointer to the page, and the
+ *	request type is encoded in the page's aflags field using the values
+ *	in PGA_QUEUE_STATE_MASK.  The type-stability of struct vm_pages is
+ *	crucial to this scheme since the processing of entries in a given
+ *	batch queue may be deferred indefinitely.  In particular, a page
+ *	may be freed before its pending batch queue entries have been
+ *	processed.  The page lock (P) must be held to schedule a batched
+ *	queue operation, and the page queue lock must be held in order to
+ *	process batch queue entries for the page queue.
  */
 
 #if PAGE_SIZE == 4096
@@ -174,7 +198,7 @@ struct vm_page {
 	TAILQ_ENTRY(vm_page) listq;	/* pages in same object (O) */
 	vm_object_t object;		/* which object am I in (O,P) */
 	vm_pindex_t pindex;		/* offset into object (O,P) */
-	vm_paddr_t phys_addr;		/* physical address of page */
+	vm_paddr_t phys_addr;		/* physical address of page (C) */
 	struct md_page md;		/* machine dependent stuff */
 	u_int wire_count;		/* wired down maps refs (P) */
 	volatile u_int busy_lock;	/* busy owners lock */
@@ -182,11 +206,11 @@ struct vm_page {
 	uint16_t flags;			/* page PG_* flags (P) */
 	uint8_t aflags;			/* access is atomic */
 	uint8_t oflags;			/* page VPO_* flags (O) */
-	uint8_t	queue;			/* page queue index (P,Q) */
+	uint8_t	queue;			/* page queue index (Q) */
 	int8_t psind;			/* pagesizes[] index (O) */
 	int8_t segind;			/* vm_phys segment index (C) */
-	uint8_t	order;			/* index of the buddy queue */
-	uint8_t pool;			/* vm_phys freepool index (Q) */
+	uint8_t	order;			/* index of the buddy queue (F) */
+	uint8_t pool;			/* vm_phys freepool index (F) */
 	u_char	act_count;		/* page usage count (P) */
 	/* NOTE that these must support one bit per DEV_BSIZE in a page */
 	/* so, on normal X86 kernels, they must be at least 8 bits wide */
@@ -314,11 +338,33 @@ extern struct mtx_padalign pa_lock[];
  *
  * PGA_EXECUTABLE may be set by pmap routines, and indicates that a page has
  * at least one executable mapping.  It is not consumed by the MI VM layer.
+ *
+ * PGA_ENQUEUED is set and cleared when a page is inserted into or removed
+ * from a page queue, respectively.  It determines whether the plinks.q field
+ * of the page is valid.  To set or clear this flag, the queue lock for the
+ * page must be held: the page queue lock corresponding to the page's "queue"
+ * field if its value is not PQ_NONE, and the page lock otherwise.
+ *
+ * PGA_DEQUEUE is set when the page is scheduled to be dequeued from a page
+ * queue, and cleared when the dequeue request is processed.  A page may
+ * have PGA_DEQUEUE set and PGA_ENQUEUED cleared, for instance if a dequeue
+ * is requested after the page is scheduled to be enqueued but before it is
+ * actually inserted into the page queue.  The page lock must be held to set
+ * this flag, and the queue lock for the page must be held to clear it.
+ *
+ * PGA_REQUEUE is set when the page is scheduled to be requeued in its page
+ * queue.  The page lock must be held to set this flag, and the queue lock
+ * for the page must be held to clear it.
  */
 #define	PGA_WRITEABLE	0x01		/* page may be mapped writeable */
 #define	PGA_REFERENCED	0x02		/* page has been referenced */
 #define	PGA_EXECUTABLE	0x04		/* page may be mapped executable */
+#define	PGA_ENQUEUED	0x08		/* page is enqueued in a page queue */
+#define	PGA_DEQUEUE	0x10		/* page is due to be dequeued */
+#define	PGA_REQUEUE	0x20		/* page is due to be requeued */
 
+#define	PGA_QUEUE_STATE_MASK	(PGA_ENQUEUED | PGA_DEQUEUE | PGA_REQUEUE)
+
 /*
  * Page flags.  If changed at any other time than page allocation or
  * freeing, the modification must be protected by the vm_page lock.
@@ -490,7 +536,7 @@ void vm_page_dequeue(vm_page_t m);
 void vm_page_dequeue_locked(vm_page_t m);
 vm_page_t vm_page_find_least(vm_object_t, vm_pindex_t);
 void vm_page_free_phys_pglist(struct pglist *tq);
-bool vm_page_free_prep(vm_page_t m, bool pagequeue_locked);
+bool vm_page_free_prep(vm_page_t m);
 vm_page_t vm_page_getfake(vm_paddr_t paddr, vm_memattr_t memattr);
 void vm_page_initfake(vm_page_t m, vm_paddr_t paddr, vm_memattr_t memattr);
 int vm_page_insert (vm_page_t, vm_object_t, vm_pindex_t);

Modified: user/markj/vm-playground/sys/vm/vm_pageout.c
==============================================================================
--- user/markj/vm-playground/sys/vm/vm_pageout.c	Fri Mar  2 21:26:48 2018	(r330295)

*** DIFF OUTPUT TRUNCATED AT 1000 LINES ***