vmdaemon CPU usage and poor performance in 10.0-RELEASE

Thu Aug 21 18:30:40 UTC 2014

On 08/21/2014 02:57, Konstantin Belousov wrote:
> On Thu, Aug 21, 2014 at 02:08:38AM -0500, Alan Cox wrote:
>> On 08/20/2014 11:22, Polyack, Steve wrote:
>>>> -----Original Message-----
>>>> From: Alan Cox [mailto:alc at rice.edu]
>>>> Sent: Wednesday, August 20, 2014 12:09 PM
>>>> To: Polyack, Steve; freebsd-stable at freebsd.org
>>>> Subject: Re: vmdaemon CPU usage and poor performance in 10.0-RELEASE
>>>>
>>>> On 08/20/2014 10:56, Polyack, Steve wrote:
>>>>>> -----Original Message-----
>>>>>> From: Alan Cox [mailto:alc at rice.edu]
>>>>>> Sent: Wednesday, August 20, 2014 11:55 AM
>>>>>> To: Polyack, Steve; freebsd-stable at freebsd.org
>>>>>> Subject: Re: vmdaemon CPU usage and poor performance in 10.0-RELEASE
>>>>>>
>>>>>> On 08/20/2014 09:55, Polyack, Steve wrote:
>>>>>>>> -----Original Message-----
>>>>>>>> From: Polyack, Steve
>>>>>>>> Sent: Wednesday, August 20, 2014 9:14 AM
>>>>>>>> To: Polyack, Steve; Alan Cox; freebsd-stable at freebsd.org
>>>>>>>> Subject: RE: vmdaemon CPU usage and poor performance in 10.0-
>>>> RELEASE
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: owner-freebsd-stable at freebsd.org [mailto:owner-freebsd-
>>>>>>>>> stable at freebsd.org] On Behalf Of Polyack, Steve
>>>>>>>>> Sent: Tuesday, August 19, 2014 12:37 PM
>>>>>>>>> To: Alan Cox; freebsd-stable at freebsd.org
>>>>>>>>> Subject: RE: vmdaemon CPU usage and poor performance in 10.0-
>>>>>> RELEASE
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: owner-freebsd-stable at freebsd.org [mailto:owner-freebsd-
>>>>>>>>>> stable at freebsd.org] On Behalf Of Alan Cox
>>>>>>>>>> Sent: Monday, August 18, 2014 6:07 PM
>>>>>>>>>> To: freebsd-stable at freebsd.org
>>>>>>>>>> Subject: Re: vmdaemon CPU usage and poor performance in 10.0-
>>>>>>>> RELEASE
>>>>>>>>>> On 08/18/2014 16:29, Polyack, Steve wrote:
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: owner-freebsd-stable at freebsd.org [mailto:owner-freebsd-
>>>>>>>>>>>> stable at freebsd.org] On Behalf Of Alan Cox
>>>>>>>>>>>> Sent: Monday, August 18, 2014 3:05 PM
>>>>>>>>>>>> To: freebsd-stable at freebsd.org
>>>>>>>>>>>> Subject: Re: vmdaemon CPU usage and poor performance in 10.0-
>>>>>>>>> RELEASE
>>>>>>>>>>>> On 08/18/2014 13:42, Polyack, Steve wrote:
>>>>>>>>>>>>> Excuse my poorly formatted reply at the moment, but this seems
>>>> to
>>>>>>>>>> have
>>>>>>>>>>>> fixed our problems.  I'm going to update the bug report with a
>>>> note.
>>>>>>>>>>>>> Thanks Alan!
>>>>>>>>>>>> You're welcome.  And, thanks for letting me know of the outcome.
>>>>>>>>>>>>
>>>>>>>>>>> Actually, I may have spoken too soon, as it looks like we're seeing
>>>>>>>>>> vmdaemon tying up the system again:
>>>>>>>>>>> root                6  100.0  0.0        0       16  -  DL   Wed04PM      4:37.95
>>>>>>>>> [vmdaemon]
>>>>>>>>>>> Is there anything I can check to help narrow down what may be the
>>>>>>>>>> problem?  KTrace/truss on the "process" doesn't give any
>>>> information, I
>>>>>>>>>> suppose because it's actually a kernel thread.
>>>>>>>>>>
>>>>>>>>>> Can you provide the full output of top?  Is there anything unusual
>>>>>> about
>>>>>>>>>> the hardware or software configuration?
>>>>>>>>> This may have just been a fluke (maybe NFS caching the old
>>>>>> vm_pageout.c
>>>>>>>>> during the first source build).  We've rebuilt and are monitoring it
>>>> now.
>>>>>>>>> The hardware consists of a few Dell PowerEdge R720xd servers with
>>>>>> 256GB
>>>>>>>>> of RAM and array of SSDs (no ZFS).  64GB is dedicated to postgres
>>>>>>>>> shared_buffers right now. FreeBSD 10, PostgreSQL 9.3, Slony-I v2.2.2,
>>>>>> and
>>>>>>>>> redis-2.8.11 are all in use here.  I can't say that anything is unusual
>>>> about
>>>>>>>> the
>>>>>>>>> configuration.
>>>>>>>>>
>>>>>>>> We are still seeing the issue.  It seems to manifest once the "Free"
>>>>>> memory
>>>>>>>> gets under 10GB (of 256GB on the system), even though ~200GB of this
>>>> is
>>>>>>>> classified as Inactive.  For us, this was about 7 hours of database
>>>> activity
>>>>>>>> (initial replication w/ slony).  Right now vmdaemon is consuming 100%
>>>>>> CPU
>>>>>>>> and shows 671:34 CPU time when it showed 0:00 up until the problem
>>>>>>>> manifested.  The full top output (that fits on my screen) is below:
>>>>>>>>
>>>>>>>> last pid: 62309;  load averages:  4.05,  4.24,  4.10
>>>>>>>> up 0+22:34:31  09:08:43
>>>>>>>> 159 processes: 8 running, 145 sleeping, 1 waiting, 5 lock
>>>>>>>> CPU: 14.5% user,  0.0% nice,  4.9% system,  0.0% interrupt, 80.5% idle
>>>>>>>> Mem: 26G Active, 216G Inact, 4122M Wired, 1178M Cache, 1632M Buf,
>>>>>> 2136M
>>>>>>>> Free
>>>>>>>> Swap: 32G Total, 32G Free
>>>>>>>>
>>>>>>>>   PID USERNAME       THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU
>>>>>>>> COMMAND
>>>>>>>>    11 root            32 155 ki31     0K   512K CPU31  31 669.6H 2934.23% idle
>>>>>>>>     6 root             1 -16    -     0K    16K CPU19  19 678:57 100.00% vmdaemon
>>>>>>>>  1963 pgsql            1  45    0 67538M   208M CPU0    0 121:46  17.38%
>>>> postgres
>>>>>>>>  2037 pgsql            1  77    0 67536M  2200K *vm ob 14   6:24  15.97%
>>>> postgres
>>>>>>>>  1864 pgsql            1  31    0 67536M  1290M semwai  4 174:41  15.19%
>>>>>> postgres
>>>>>>>>  1996 pgsql            1  38    0 67538M   202M semwai 16 120:27  15.09%
>>>>>> postgres
>>>>>>>>  1959 pgsql            1  39    0 67538M   204M CPU27  27 117:30  15.09%
>>>> postgres
>>>>>>>>  1849 pgsql            1  32    0 67536M  1272M semwai 23 126:22  13.96%
>>>>>> postgres
>>>>>>>>  1997 pgsql            1  31    0 67538M   206M CPU30  30 122:26  11.77%
>>>> postgres
>>>>>>>>  2002 pgsql            1  34    0 67538M   182M sbwait 11  55:20  11.28%
>>>> postgres
>>>>>>>>  1961 pgsql            1  32    0 67538M   206M CPU12  12 121:47  10.99%
>>>> postgres
>>>>>>>>  1964 pgsql            1  30    0 67538M   206M semwai 28 122:08   9.86%
>>>> postgres
>>>>>>>>  1962 pgsql            1  29    0 67538M  1286M sbwait  2  45:49   7.18%
>>>> postgres
>>>>>>>>  1752 root             1  22    0 78356K  8688K CPU2    2 175:46   6.88% snmpd
>>>>>>>>  1965 pgsql            1  25    0 67538M   207M semwai  9 120:55   6.59%
>>>> postgres
>>>>>>>>  1960 pgsql            1  23    0 67538M   177M semwai  6  52:42   4.88%
>>>> postgres
>>>>>>>>  1863 pgsql            1  25    0 67542M   388M semwai 25   9:12   2.20%
>>>> postgres
>>>>>>>>  1859 pgsql            1  22    0 67538M  1453M *vm ob 20   6:13   2.10%
>>>> postgres
>>>>>>>>  1860 pgsql            1  22    0 67538M  1454M sbwait  8   6:08   1.95% postgres
>>>>>>>>  1848 pgsql            1  21    0 67586M 66676M *vm ob 30 517:07   1.66%
>>>>>> postgres
>>>>>>>>  1856 pgsql            1  22    0 67538M   290M *vm ob 15   5:39   1.66%
>>>> postgres
>>>>>>>>  1846 pgsql            1  21    0 67538M   163M sbwait 15   5:46   1.46% postgres
>>>>>>>>  1853 pgsql            1  21    0 67538M   110M sbwait 30   8:54   1.17% postgres
>>>>>>>>  1989 pgsql            1  23    0 67536M  5180K sbwait 18   1:41   0.98% postgres
>>>>>>>>     5 root             1 -16    -     0K    16K psleep  6   9:33   0.78% pagedaemon
>>>>>>>>  1854 pgsql            1  20    0 67538M   338M sbwait 22   5:38   0.78% postgres
>>>>>>>>  1861 pgsql            1  20    0 67538M   286M sbwait 15   6:13   0.68% postgres
>>>>>>>>  1857 pgsql            1  20    0 67538M  1454M semwai 10   6:19   0.49%
>>>> postgres
>>>>>>>>  1999 pgsql            1  36    0 67538M   156M *vm ob 28 120:56   0.39%
>>>> postgres
>>>>>>>>  1851 pgsql            1  20    0 67538M   136M sbwait 22   5:48   0.39% postgres
>>>>>>>>  1975 pgsql            1  20    0 67536M  5688K sbwait 25   1:40   0.29% postgres
>>>>>>>>  1858 pgsql            1  20    0 67538M   417M sbwait  3   5:55   0.20% postgres
>>>>>>>>  2031 pgsql            1  20    0 67536M  5664K sbwait  5   3:26   0.10% postgres
>>>>>>>>  1834 root            12  20    0 71892K 12848K select 20  34:05   0.00% slon
>>>>>>>>    12 root            78 -76    -     0K  1248K WAIT    0  25:47   0.00% intr
>>>>>>>>  2041 pgsql            1  20    0 67536M  5932K sbwait 14  12:50   0.00%
>>>> postgres
>>>>>>>>  2039 pgsql            1  20    0 67536M  5960K sbwait 17   9:59   0.00% postgres
>>>>>>>>  2038 pgsql            1  20    0 67536M  5956K sbwait  6   8:21   0.00% postgres
>>>>>>>>  2040 pgsql            1  20    0 67536M  5996K sbwait  7   8:20   0.00% postgres
>>>>>>>>  2032 pgsql            1  20    0 67536M  5800K sbwait 22   7:03   0.00% postgres
>>>>>>>>  2036 pgsql            1  20    0 67536M  5748K sbwait 23   6:38   0.00% postgres
>>>>>>>>  1812 pgsql            1  20    0 67538M 59185M select  1   5:46   0.00% postgres
>>>>>>>>  2005 pgsql            1  20    0 67536M  5788K sbwait 23   5:14   0.00% postgres
>>>>>>>>  2035 pgsql            1  20    0 67536M  4892K sbwait 18   4:52   0.00%
>>>> <postgres>
>>>>>>>>  1852 pgsql            1  21    0 67536M  1230M semwai  7   4:47   0.00%
>>>> postgres
>>>>>>>>    13 root             3  -8    -     0K    48K -      28   4:46   0.00% geom
>>>>>>>>
>>>>>>>>
>>>>>>> Another thing I've noticed is that this sysctl vm.stats counter is
>>>> increasing
>>>>>> fairly rapidly:
>>>>>>> # sysctl vm.stats.vm.v_pdpages && sleep 1 && sysctl
>>>>>> vm.stats.vm.v_pdpages
>>>>>>> vm.stats.vm.v_pdpages: 3455264541
>>>>>>> vm.stats.vm.v_pdpages: 3662158383
>>>>>> I'm not sure what that tells us, because both the page daemon and the
>>>> vm
>>>>>> ("swap") daemon increment this counter.
>>>>>>
>>>>>>> Also, to demonstrate what kind of problems this seems to cause:
>>>>>>> # time sleep 1
>>>>>>>
>>>>>>> real	0m18.288s
>>>>>>> user	0m0.001s
>>>>>>> sys	0m0.004s
>>>>>> If you change the sysctl vm.swap_enabled to 0, how does your system
>>>>>> behave?
>>>>>>
>>>>> Setting vm.swap_enabled to 0 made the problem clear up almost
>>>> instantly.  vmdaemon is back to 0.00% CPU usage and the system is
>>>> responsive once again.
>>>> I doubt that you need whole process swapping.  The page daemon is
>>>> probably sufficient.  See how things go for a few days and let me know.
>>>>
>>>> There is still a bug here that needs diagnosing and fixing.  So, I will
>>>> likely send you a debugging patch in the near future, and ask you to
>>>> reenable swapping under that patch.
>>>>
>>> If it helps at all - setting vm.swap_enabled=0 seems to fix the problem even without the aforementioned patch to vm_pageout.c.
>>>
>>
>> I have a couple hypotheses for what is causing your problem.  The
>> attached patch addresses one of them.  Please apply this patch and then
>> reset vm._swap_enabled back to 1.
>>
>>
>> Index: vm/vm_pageout.c
>> ===================================================================
>> --- vm/vm_pageout.c	(revision 270258)
>> +++ vm/vm_pageout.c	(working copy)
>> @@ -1309,6 +1309,20 @@ relock_queues:
>>  	vm_pagequeue_unlock(pq);
>>  
>>  	/*
>> +	 * If we didn't get enough free pages, and we have skipped a vnode
>> +	 * in a writeable object, wakeup the sync daemon.  And kick swapout
>> +	 * if we did not get enough free pages.
>> +	 */
>> +	if (page_shortage > 0) {
>> +		if (vnodes_skipped && vm_page_count_min())
>> +			(void) speedup_syncer();
>> +#if !defined(NO_SWAPPING)
>> +		if (vm_swap_enabled)
>> +			vm_req_vmdaemon(VM_SWAP_NORMAL);
>> +#endif
>> +	}
>> +
>> +	/*
>>  	 * Compute the number of pages we want to try to move from the
>>  	 * active queue to the inactive queue.
>>  	 */
>> @@ -1418,20 +1432,6 @@ relock_queues:
>>  		}
>>  	}
>>  #endif
>> -		
>> -	/*
>> -	 * If we didn't get enough free pages, and we have skipped a vnode
>> -	 * in a writeable object, wakeup the sync daemon.  And kick swapout
>> -	 * if we did not get enough free pages.
>> -	 */
>> -	if (vm_paging_target() > 0) {
>> -		if (vnodes_skipped && vm_page_count_min())
>> -			(void) speedup_syncer();
>> -#if !defined(NO_SWAPPING)
>> -		if (vm_swap_enabled && vm_page_count_target())
>> -			vm_req_vmdaemon(VM_SWAP_NORMAL);
>> -#endif
>> -	}
>>  
>>  	/*
>>  	 * If we are critically low on one of RAM or swap and low on
> I suspect that what happen is the contention on vm object lock
> between page fault handler and pageout scan.  If faults are frequent
> enough, pagedaemon would be unable to try-lock the object backing
> the postgresql shared area, which is large enough to constitute the
> significant part of the queues.


Just to be clear, he's spinning in the vmdaemon, not the pagedaemon. 
Nonetheless, I've patched the pagedaemon above because I argue that
we're issuing bogus wakeups to the vmdaemon.  (See the debugging patch
at the bottom of this message.)

Also, I'm not sure that I understand the relevance of the try lock
here.  In the pagedaemon, if the try lock fails, we always fall back to
a blocking lock acquire.  In other words, we're not skipping the
object's pages because the try lock fails.  We only skip the page if it
has moved from its original place in the inactive queue while we're
doing the unlock, relock dance.

However, if what you mean is that contention for the object locks could
slow the progress of the page daemon through the inactive queue, such
that it meets its target, but other CPUs have had long enough to
allocate pages so that vm_paging_target() is once again greater than 0,
then I agree.  The sheer size of the inactive queue (216GB) and the page
target could also be another factor here.

Nonetheless, it's also worth considering that he has 26 GB of memory in
the active queue.  By the time we complete the active queue scan, again
the other CPUs have had plenty of time to allocate pages.  In other
words, another reason that vm_paging_target() could be greater than zero.

Basically, back in the days when the kernel was single-threaded, testing
"vm_paging_target() > 0" was a reasonable way of determining if the
inactive queue scan met its target.  However, now that we can have
people concurrently allocating pages, it's race-y.  Instead, for
determining whether to call vm_req_vmdaemon(VM_SWAP_NORMAL) we should
directly check if the inactive queue scan met its target.

(I believe that another part of the explanation for why the call to
vm_req_vmdaemon(VM_SWAP_NORMAL) comes after the active queue scan and
not immediately after the inactive queue scan is that we would cache
pages from the active queue scan under certain circumstances.  That no
longer happens.)

I suspect that the changes that we made to the paging targets for 10.0
has also played a part in bringing this issue out into the daylight.

To conclude, the race here is not postgres-specific.  I ran a make -j7
buildworld on a machine with physmem configured to 1GB with the
following patch applied:

root at 108-254-203-201:/freebsd/base/head/sys # svn di -x -p
Index: vm/vm_pageout.c
===================================================================

--- vm/vm_pageout.c     (revision 270024)
+++ vm/vm_pageout.c     (working copy)
@@ -904,7 +904,7 @@ vm_pageout_scan(struct vm_domain *vmd, int pass)
        int act_delta, addl_page_shortage, deficit, maxscan, page_shortage;
        int vnodes_skipped = 0;
        int maxlaunder;
-       int lockmode;
+       int lockmode, inactive_shortage;
        boolean_t queues_locked;
 
        /*
@@ -1299,6 +1299,8 @@ relock_queues:
        }
        vm_pagequeue_unlock(pq);
 
+       inactive_shortage = page_shortage;
+
        /*
         * Compute the number of pages we want to try to move from the
         * active queue to the inactive queue.
@@ -1415,6 +1417,7 @@ relock_queues:
         * if we did not get enough free pages.
         */
        if (vm_paging_target() > 0) {
+printf("%d %d\n", inactive_shortage, vm_paging_target());
                if (vnodes_skipped && vm_page_count_min())
                        (void) speedup_syncer();
 #if !defined(NO_SWAPPING)

and I saw:

...
0 473
0 384
0 364
0 1522
0 1222
0 95
0 434
0 610
0 490
0 1033
0 499
0 1222
0 676
0 546
0 272
0 345
0 531
0 1064
0 1487
0 1619
0 1558
0 2494
0 601
0 343
0 381
0 1007
0 4393
0 4394
0 177
0 38
0 99
0 192
0 253
0 242
0 76
0 180
0 215
0 167
0 87
0 36
0 160
0 267
0 2198
0 231
0 2080
0 2309
0 2142
1021 3520
2525 4393
3735 3811
2786 4138
3323 4331
3203 4320
731 4258
0 224
0 126
53 140
3462 3752
0 885
0 1900
...

In other words, a lot of vmdaemon wakeups even though the inactive queue
scan cached/freed the desired number of pages.
 


> The fault fast path does not help there, since the object lock is
> still taken on fault, even readonly, which conflicts with the pagedaemon
> write lock mode.
>
> This is one of the reasons why I added MAP_SHARED_PHYS flag.