Reoccurring ZFS performance problems [RESOLVED]

Karl Denninger karl at denninger.net
Fri Mar 14 11:21:58 UTC 2014


On 3/12/2014 1:01 PM, Karl Denninger wrote:
>
> On 3/10/2014 2:38 PM, Adrian Gschwend wrote:
>> On 10.03.14 18:40, Adrian Gschwend wrote:
>>
>>> It looks like finally my MySQL process finished and now the system is
>>> back to completely fine:
>> ok it doesn't look it's only MySQL, stopped the process a while ago and
>> while it got calmer, I still have the issue.
> ZFS can be convinced to engage in what I can only surmise is 
> pathological behavior, and I've seen no fix for it when it happens -- 
> but there are things you can do to mitigate it.
>
> What IMHO _*should*_ happen is that the ARC cache should shrink as 
> necessary to prevent paging, subject to vfs.zfs.arc_min.  To prevent 
> pathological problems with segments that have been paged off hours (or 
> more!) ago and never get paged back in because that particular piece 
> of code never executes again (but the process is also still alive so 
> the system cannot reclaim it and thus it shows "committed" in pstat -s 
> but unless it is paged back in has no impact on system performance) 
> the policing on this would have to apply a "reasonableness" filter to 
> those pages (e.g. if it has been out on the page file for longer than 
> "X", ignore that particular allocation unit for this purpose.)
>
> This would cause the ARC cache to flush itself down automatically as 
> executable and data segment RAM commitments increase.
>
> The documentation says that this is the case and how it should work 
> but it doesn't appear to actually be this way in practice for many 
> workloads.  I have seen "wired" RAM pinned at 20GB on one of my 
> servers here with a fairly large DBMS running -- with pieces of its 
> working set and even the a user's shell (!) getting paged off, yet the 
> ARC cache is not pared down to release memory.  Indeed you can let the 
> system run for hours under these conditions and the ARC wired memory 
> will not decrease.  Cutting back the DBMS's internal buffering does 
> not help.
>
> What I've done here is restrict the ARC cache size in an attempt to 
> prevent this particular bit of bogosity from biting me, and it appears 
> to (sort of) work.  Unfortunately you cannot tune this while the 
> system is running (otherwise a user daemon could conceivably slash 
> away at the arc_max sysctl and force the deallocation of wired memory 
> if it detected paging -- or near-paging, such as free memory below 
> some user-configured threshold), only at boot time in /boot/loader.conf.
>
> This is something that, should I get myself a nice hunk of free time, 
> I may dive into and attempt to fix.  It would likely take me quite a 
> while to get up to speed on this as I've not gotten into the zfs code 
> at all -- and mistakes in there could easily corrupt files....  (in 
> other words definitely NOT something to play with on a production 
> system!)
>
> I have to assume there's a pretty-good reason why you can't change 
> arc_max while the system is running; it _*can*_ be changed on a 
> running system on some other implementations (e.g. Solaris.)  It is 
> marked with CTLFLAG_RDTUN in the arc management file which prohibits 
> run-time changes and the only place I see it referenced with a quick 
> look is in the arc_init code.
>
> Note that the test in arc.c for "arc_reclaim_needed" appears to be 
> pretty basic -- essentially the system will not aggressively try to 
> reclaim memory unless used kmem > 3/4 of its size.
>
> (snippet from arc.c around line 2494 of arc.c in 10-STABLE; path 
> /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs)
>
> #else   /* !sun */
>         if (kmem_used() > (kmem_size() * 3) / 4)
>                 return (1);
> #endif  /* sun */
>
> Up above that there's a test for "vm_paging_needed()" that would 
> (theoretically) appear to trigger first in these situations, but it 
> doesn't in many cases.
>
> IMHO this is too-basic of a test and leads to pathological situations 
> in that the system may wind up paging things off as opposed to paring 
> back the ARC cache.  As soon as the working set of something that's 
> actually getting cycles gets paged out in most cases system 
> performance goes straight in the trash.
>
> On sun machines (from reading the code) it will allegedly try to pare 
> any time the "lotsfree" (plus "needfree" + "extra") amount of free 
> memory is invaded.
>
> As an example this is what a server I own that is exhibiting this 
> behavior now shows:
> 20202500 wire
>   1414052 act
>   2323280 inact
>   110340 cache
>    414484 free
>  1694896 buf
>
> Of that "wired" mem 15.7G of it is ARC cache (with a target of 15.81, 
> so it's essentially right up against it.)
>
> That "free" number would be ok if it didn't result in the system 
> having trashy performance -- but it does on occasion. Incidentally the 
> allocated swap is about 195k blocks (~200 Megabytes) which isn't much 
> all-in, but it's enough to force actual fetches of recently-used 
> programs (e.g. your shell!) from paged-off space. The thing is that if 
> the test in the code (75% of kmem available consumed) was looking only 
> at "free" the system should be aggressively trying to free up ARC 
> cache.  It clearly is not; the included code calls this:
>
> uint64_t
> kmem_used(void)
> {
>
>         return (vmem_size(kmem_arena, VMEM_ALLOC));
> }
>
> I need to dig around and see exactly what that's measuring, because 
> what's quite clear is that the system _*thinks*_ it has plenty of free 
> memory when it very-clearly is essentially out!  In fact free memory 
> at the moment (~400MB) is 1.7% of the total, _*not*_ 25%.  From this I 
> surmise that the "vmem_size" call is not returning the sum of all the 
> above "in use" sizes (except perhaps "inact"); were it to do so that 
> would be essentially 100% of installed RAM and the ARC cache should be 
> actively under shrinkage, but it clearly is not.
>
> I'll keep this one on my "to-do" list somewhere and if I get the 
> chance see if I can come up with a better test.  What might be 
> interesting is to change the test to be "pare if free space less 
> (pagefile space in use plus some modest margin) < 0"
>
> Fixing this tidbit of code could potentially be pretty significant in 
> terms of resolving the occasional but very annoying "freeze" problems 
> that people sometimes run into, along with some mildly-pathological 
> but very-significant behavior in terms of how the ARC cache 
> auto-scales and its impact on performance.  I'm nowhere near 
> up-to-speed enough on the internals of the kernel when it comes to 
> figuring out what it has committed (e.g. how much swap is out, etc) 
> and thus there's going to be a lot of code-reading involved before I 
> can attempt something useful.
>

In the context of the above, here's a fix.  Enjoy.

http://www.freebsd.org/cgi/query-pr.cgi?pr=187572

> Category:       kern
> Responsible:    freebsd-bugs
> Synopsis:       ZFS ARC cache code does not properly handle low memory
> Arrival-Date:   Fri Mar 14 11:20:00 UTC 2014

-- 
-- Karl
karl at denninger.net


-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2711 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20140314/98220aa1/attachment.bin>


More information about the freebsd-fs mailing list