Strange ARC/Swap/CPU on yesterday's -CURRENT

Fri Apr 6 17:33:38 UTC 2018

On  4 Apr, Mark Johnston wrote:
> On Tue, Apr 03, 2018 at 09:42:48PM -0700, Don Lewis wrote:
>> On  3 Apr, Don Lewis wrote:
>> > I reconfigured my Ryzen box to be more similar to my default package
>> > builder by disabling SMT and half of the RAM, to limit it to 8 cores
>> > and 32 GB and then started bisecting to try to track down the problem.
>> > For each test, I first filled ARC by tarring /usr/ports/distfiles to
>> > /dev/null.  The commit range that I was searching was r329844 to
>> > r331716.  I narrowed the range to r329844 to r329904.  With r329904
>> > and newer, ARC is totally unresponsive to memory pressure and the
>> > machine pages heavily.  I see ARC sizes of 28-29GB and 30GB of wired
>> > RAM, so there is not much leftover for getting useful work done.  Active
>> > memory and free memory both hover under 1GB each.  Looking at the
>> > commit logs over this range, the most likely culprit is:
>> > 
>> > r329882 | jeff | 2018-02-23 14:51:51 -0800 (Fri, 23 Feb 2018) | 13 lines
>> > 
>> > Add a generic Proportional Integral Derivative (PID) controller algorithm and
>> > use it to regulate page daemon output.
>> > 
>> > This provides much smoother and more responsive page daemon output, anticipating
>> > demand and avoiding pageout stalls by increasing the number of pages to match
>> > the workload.  This is a reimplementation of work done by myself and mlaier at
>> > Isilon.
>> > 
>> > 
>> > It is quite possible that the recent fixes to the PID controller will
>> > fix the problem.  Not that r329844 was trouble free ... I left tar
>> > running over lunchtime to fill ARC and the OOM killer nuked top, tar,
>> > ntpd, both of my ssh sessions into the machine, and multiple instances
>> > of getty while I was away.  I was able to log in again and successfully
>> > run poudriere, and ARC did respond to the memory pressure and cranked
>> > itself down to about 5 GB by the end of the run.  I did not see the same
>> > problem with tar when I did the same with r329904.
>> 
>> I just tried r331966 and see no improvement.  No OOM process kills
>> during the tar run to fill ARC, but with ARC filled, the machine is
>> thrashing itself at the start of the poudriere run while trying to build
>> ports-mgmt/pkg (39 minutes so far).  ARC appears to be unresponsive to
>> memory demand.  I've seen no decrease in ARC size or wired memory since
>> starting poudriere.
> 
> Re-reading the ARC reclaim code, I see a couple of issues which might be
> at the root of the behaviour you're seeing.
> 
> 1. zfs_arc_free_target is too low now. It is initialized to the page
>    daemon wakeup threshold, which is slightly above v_free_min. With the
>    PID controller, the page daemon uses a setpoint of v_free_target.
>    Moreover, it now wakes up regularly rather than having wakeups be
>    synchronized by a mutex, so it will respond quickly if the free page
>    count dips below v_free_target. The free page count will dip below
>    zfs_arc_free_target only in the face of sudden and extreme memory
>    pressure now, so the FMT_LOTSFREE case probably isn't getting
>    exercised. Try initializing zfs_arc_free_target to v_free_target.
> 
> 2. In the inactive queue scan, we used to compute the shortage after
>    running uma_reclaim() and the lowmem handlers (which includes a
>    synchronous call to arc_lowmem()). Now it's computed before, so we're
>    not taking into account the pages that get freed by the ARC and UMA.
>    The following rather hacky patch may help. I note that the lowmem
>    logic is now somewhat broken when multiple NUMA domains are
>    configured, however, since it fires only when domain 0 has a free
>    page shortage.
> 
> Index: sys/vm/vm_pageout.c
> ===================================================================
> --- sys/vm/vm_pageout.c	(revision 331933)
> +++ sys/vm/vm_pageout.c	(working copy)
> @@ -1114,25 +1114,6 @@
>  	boolean_t queue_locked;
>  
>  	/*
> -	 * If we need to reclaim memory ask kernel caches to return
> -	 * some.  We rate limit to avoid thrashing.
> -	 */
> -	if (vmd == VM_DOMAIN(0) && pass > 0 &&
> -	    (time_uptime - lowmem_uptime) >= lowmem_period) {
> -		/*
> -		 * Decrease registered cache sizes.
> -		 */
> -		SDT_PROBE0(vm, , , vm__lowmem_scan);
> -		EVENTHANDLER_INVOKE(vm_lowmem, VM_LOW_PAGES);
> -		/*
> -		 * We do this explicitly after the caches have been
> -		 * drained above.
> -		 */
> -		uma_reclaim();
> -		lowmem_uptime = time_uptime;
> -	}
> -
> -	/*
>  	 * The addl_page_shortage is the number of temporarily
>  	 * stuck pages in the inactive queue.  In other words, the
>  	 * number of pages from the inactive count that should be
> @@ -1824,6 +1805,26 @@
>  		atomic_store_int(&vmd->vmd_pageout_wanted, 1);
>  
>  		/*
> +		 * If we need to reclaim memory ask kernel caches to return
> +		 * some.  We rate limit to avoid thrashing.
> +		 */
> +		if (vmd == VM_DOMAIN(0) &&
> +		    vmd->vmd_free_count < vmd->vmd_free_target &&
> +		    (time_uptime - lowmem_uptime) >= lowmem_period) {
> +			/*
> +			 * Decrease registered cache sizes.
> +			 */
> +			SDT_PROBE0(vm, , , vm__lowmem_scan);
> +			EVENTHANDLER_INVOKE(vm_lowmem, VM_LOW_PAGES);
> +			/*
> +			 * We do this explicitly after the caches have been
> +			 * drained above.
> +			 */
> +			uma_reclaim();
> +			lowmem_uptime = time_uptime;
> +		}
> +
> +		/*
>  		 * Use the controller to calculate how many pages to free in
>  		 * this interval.
>  		 */