Reoccurring ZFS performance problems [RESOLVED]
Karl Denninger
karl at denninger.net
Fri Mar 14 11:21:58 UTC 2014
On 3/12/2014 1:01 PM, Karl Denninger wrote:
>
> On 3/10/2014 2:38 PM, Adrian Gschwend wrote:
>> On 10.03.14 18:40, Adrian Gschwend wrote:
>>
>>> It looks like finally my MySQL process finished and now the system is
>>> back to completely fine:
>> ok it doesn't look it's only MySQL, stopped the process a while ago and
>> while it got calmer, I still have the issue.
> ZFS can be convinced to engage in what I can only surmise is
> pathological behavior, and I've seen no fix for it when it happens --
> but there are things you can do to mitigate it.
>
> What IMHO _*should*_ happen is that the ARC cache should shrink as
> necessary to prevent paging, subject to vfs.zfs.arc_min. To prevent
> pathological problems with segments that have been paged off hours (or
> more!) ago and never get paged back in because that particular piece
> of code never executes again (but the process is also still alive so
> the system cannot reclaim it and thus it shows "committed" in pstat -s
> but unless it is paged back in has no impact on system performance)
> the policing on this would have to apply a "reasonableness" filter to
> those pages (e.g. if it has been out on the page file for longer than
> "X", ignore that particular allocation unit for this purpose.)
>
> This would cause the ARC cache to flush itself down automatically as
> executable and data segment RAM commitments increase.
>
> The documentation says that this is the case and how it should work
> but it doesn't appear to actually be this way in practice for many
> workloads. I have seen "wired" RAM pinned at 20GB on one of my
> servers here with a fairly large DBMS running -- with pieces of its
> working set and even the a user's shell (!) getting paged off, yet the
> ARC cache is not pared down to release memory. Indeed you can let the
> system run for hours under these conditions and the ARC wired memory
> will not decrease. Cutting back the DBMS's internal buffering does
> not help.
>
> What I've done here is restrict the ARC cache size in an attempt to
> prevent this particular bit of bogosity from biting me, and it appears
> to (sort of) work. Unfortunately you cannot tune this while the
> system is running (otherwise a user daemon could conceivably slash
> away at the arc_max sysctl and force the deallocation of wired memory
> if it detected paging -- or near-paging, such as free memory below
> some user-configured threshold), only at boot time in /boot/loader.conf.
>
> This is something that, should I get myself a nice hunk of free time,
> I may dive into and attempt to fix. It would likely take me quite a
> while to get up to speed on this as I've not gotten into the zfs code
> at all -- and mistakes in there could easily corrupt files.... (in
> other words definitely NOT something to play with on a production
> system!)
>
> I have to assume there's a pretty-good reason why you can't change
> arc_max while the system is running; it _*can*_ be changed on a
> running system on some other implementations (e.g. Solaris.) It is
> marked with CTLFLAG_RDTUN in the arc management file which prohibits
> run-time changes and the only place I see it referenced with a quick
> look is in the arc_init code.
>
> Note that the test in arc.c for "arc_reclaim_needed" appears to be
> pretty basic -- essentially the system will not aggressively try to
> reclaim memory unless used kmem > 3/4 of its size.
>
> (snippet from arc.c around line 2494 of arc.c in 10-STABLE; path
> /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs)
>
> #else /* !sun */
> if (kmem_used() > (kmem_size() * 3) / 4)
> return (1);
> #endif /* sun */
>
> Up above that there's a test for "vm_paging_needed()" that would
> (theoretically) appear to trigger first in these situations, but it
> doesn't in many cases.
>
> IMHO this is too-basic of a test and leads to pathological situations
> in that the system may wind up paging things off as opposed to paring
> back the ARC cache. As soon as the working set of something that's
> actually getting cycles gets paged out in most cases system
> performance goes straight in the trash.
>
> On sun machines (from reading the code) it will allegedly try to pare
> any time the "lotsfree" (plus "needfree" + "extra") amount of free
> memory is invaded.
>
> As an example this is what a server I own that is exhibiting this
> behavior now shows:
> 20202500 wire
> 1414052 act
> 2323280 inact
> 110340 cache
> 414484 free
> 1694896 buf
>
> Of that "wired" mem 15.7G of it is ARC cache (with a target of 15.81,
> so it's essentially right up against it.)
>
> That "free" number would be ok if it didn't result in the system
> having trashy performance -- but it does on occasion. Incidentally the
> allocated swap is about 195k blocks (~200 Megabytes) which isn't much
> all-in, but it's enough to force actual fetches of recently-used
> programs (e.g. your shell!) from paged-off space. The thing is that if
> the test in the code (75% of kmem available consumed) was looking only
> at "free" the system should be aggressively trying to free up ARC
> cache. It clearly is not; the included code calls this:
>
> uint64_t
> kmem_used(void)
> {
>
> return (vmem_size(kmem_arena, VMEM_ALLOC));
> }
>
> I need to dig around and see exactly what that's measuring, because
> what's quite clear is that the system _*thinks*_ it has plenty of free
> memory when it very-clearly is essentially out! In fact free memory
> at the moment (~400MB) is 1.7% of the total, _*not*_ 25%. From this I
> surmise that the "vmem_size" call is not returning the sum of all the
> above "in use" sizes (except perhaps "inact"); were it to do so that
> would be essentially 100% of installed RAM and the ARC cache should be
> actively under shrinkage, but it clearly is not.
>
> I'll keep this one on my "to-do" list somewhere and if I get the
> chance see if I can come up with a better test. What might be
> interesting is to change the test to be "pare if free space less
> (pagefile space in use plus some modest margin) < 0"
>
> Fixing this tidbit of code could potentially be pretty significant in
> terms of resolving the occasional but very annoying "freeze" problems
> that people sometimes run into, along with some mildly-pathological
> but very-significant behavior in terms of how the ARC cache
> auto-scales and its impact on performance. I'm nowhere near
> up-to-speed enough on the internals of the kernel when it comes to
> figuring out what it has committed (e.g. how much swap is out, etc)
> and thus there's going to be a lot of code-reading involved before I
> can attempt something useful.
>
In the context of the above, here's a fix. Enjoy.
http://www.freebsd.org/cgi/query-pr.cgi?pr=187572
> Category: kern
> Responsible: freebsd-bugs
> Synopsis: ZFS ARC cache code does not properly handle low memory
> Arrival-Date: Fri Mar 14 11:20:00 UTC 2014
--
-- Karl
karl at denninger.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2711 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20140314/98220aa1/attachment.bin>
More information about the freebsd-fs
mailing list