ZFS ARC under memory pressure
Karl Denninger
karl at denninger.net
Sat Aug 20 16:08:54 UTC 2016
On 8/20/2016 10:22, Konstantin Belousov wrote:
> On Fri, Aug 19, 2016 at 03:38:55PM -0500, Karl Denninger wrote:
>> Paging *always* requires one I/O (to write the page(s) to the swap) and
>> MAY involve two (to later page it back in.) It is never a "win" to
>> spend a *guaranteed* I/O when you can instead act in a way that *might*
>> cause you to (later) need to execute one.
> Why would pagedaemon need to write out clean page ?
If you are talking about the case of an executable in which part of the
text is evicted you are correct, however, you are still choosing in that
instance to evict a page for which there will likely be a future demand
and thus require an I/O (should that executable come back up for
execution) as opposed to one for which you have no idea how likely
demand for same will be (a data page in the ARC.)
Since the VM has no means of "coloring" the ARC (as it is opaque other
than the consumption of system memory to the VM) as to how "useful"
(e.g. how often used, etc) a particular data item in the ARC is, it has
no information available on which to decide. However, the fact that an
executing process is in some sort of waiting state still likely trumps
an ARC data page in terms of likelihood of future access.
root at NewFS:/usr/src/sys/amd64/conf # pstat -s
Device 1K-blocks Used Avail Capacity
/dev/mirror/sw.eli 67108860 291356 66817504 0%
While this is not a large amount of page space used I can assure you
that at no time since boot was all 32GB of memory in the machine
consumed with other-than-ARC data. As such for the VM system to have
decided to evict pages to the swap file rather than the ARC be pared
back is demonstrably wrong since the result was the execution of I/Os on
the *speculative* bet that a page in the ARC would preferentially be
required.
On 10.x, unpatched, there were fairly trivial "added" workload choices
that one might make on a routine basis (e.g. "make -j8 buildworld") on
this machine that, if you had a largish text file open in "vi", would
lead to user-perceived stalls exceeding 10 seconds in length during
which that process's working set had been evicted so as to keep ARC
cache data! While it might at first blush appear that the Postgres
database consumers on the same machine would be happy with this when
*their* RSS got paged out and *they* took the resulting 10+ second stall
as well that certainly was not the case!
11.x does exhibit far less pathology in this regard than did 10.x
(unpatched) and I've yet to see the "stall system to the point that it
appears it has crashed" behavior that I formerly could provoke with a
trivial test.
However, the fact remains that the same machine, with the same load,
running 10.x and my patches ran for months at a time with zero page
space consumed, a fully-utilized ARC and very little slack space
(defined as RAM in "Cache" + allocated-but-unused UMA) -- in other
words, with no displayed pathology at all.
The behavior of unpatched 11.x, while very-materially better than
unpatched 10.x, IMHO does not meet this standard. In particular there
are quite-large quantities of UMA space out-but-unused on a regular basis
and while *at present* the ARC looks pretty healthy this is a weekend
when system load is quite low. During the week not only does the UMA
situation look far worse so does the ARC size and efficiency which
frequently winds up running at "half-mast" compared to where it ought to be.
I believe FreeBSD 11.x can do better and intend to roll forward the 10.x
work in an attempt to implement that.
--
Karl Denninger
karl at denninger.net <mailto:karl at denninger.net>
/The Market Ticker/
/[S/MIME encrypted email preferred]/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2996 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20160820/10888fa4/attachment.bin>
More information about the freebsd-fs
mailing list