kern/187594: [zfs] [patch] ZFS ARC behavior problem and fix
Karl Denninger
karl at denninger.net
Mon Mar 24 11:50:02 UTC 2014
The following reply was made to PR kern/187594; it has been noted by GNATS.
From: Karl Denninger <karl at denninger.net>
To: bug-followup at FreeBSD.org, karl at fs.denninger.net
Cc:
Subject: Re: kern/187594: [zfs] [patch] ZFS ARC behavior problem and fix
Date: Mon, 24 Mar 2014 06:41:16 -0500
This is a cryptographically signed message in MIME format.
--------------ms090509050705090705090709
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: quoted-printable
Update:
1. Patch is still good against latest arc.c change (associated with new=20
flags on the pool).
2. Change default low memory warning for the arc to cnt.v_page_count; no =
margin. This appears to provide the best performance and does not cause =
problems with inact pages or other misbehavior on my test systems.
3. Expose the return flag (arc_shrink_needed) so if you care to watch it =
for some reason, you can.
*** arc.c.original Sun Mar 23 14:56:01 2014
--- arc.c Sun Mar 23 15:12:15 2014
***************
*** 18,23 ****
--- 18,95 ----
*
* CDDL HEADER END
*/
+
+ /* Karl Denninger (karl at denninger.net), 3/20/2014, FreeBSD-specific
+ *
+ * If "NEWRECLAIM" is defined, change the "low memory" warning that cau=
ses
+ * the ARC cache to be pared down. The reason for the change is that t=
he
+ * apparent attempted algorithm is to start evicting ARC cache when fre=
e
+ * pages fall below 25% of installed RAM. This maps reasonably well to=
how
+ * Solaris is documented to behave; when "lotsfree" is invaded ZFS is t=
old
+ * to pare down.
+ *
+ * The problem is that on FreeBSD machines the system doesn't appear to=
be
+ * getting what the authors of the original code thought they were look=
ing at
+ * with its test -- or at least not what Solaris did -- and as a result=
that
+ * test never triggers. That leaves the only reclaim trigger as the "p=
aging
+ * needed" status flag, and by the time * that trips the system is alre=
ady
+ * in low-memory trouble. This can lead to severe pathological behavio=
r
+ * under the following scenario:
+ * - The system starts to page and ARC is evicted.
+ * - The system stops paging as ARC's eviction drops wired RAM a bit.
+ * - ARC starts increasing its allocation again, and wired memory grows=
=2E
+ * - A new image is activated, and the system once again attempts to pa=
ge.
+ * - ARC starts to be evicted again.
+ * - Back to #2
+ *
+ * Note that ZFS's ARC default (unless you override it in /boot/loader.=
conf)
+ * is to allow the ARC cache to grab nearly all of free RAM, provided n=
obody
+ * else needs it. That would be ok if we evicted cache when required.
+ *
+ * Unfortunately the system can get into a state where it never
+ * manages to page anything of materiality back in, as if there is acti=
ve
+ * I/O the ARC will start grabbing space once again as soon as the memo=
ry
+ * contention state drops. For this reason the "paging is occurring" f=
lag
+ * should be the **last resort** condition for ARC eviction; you want t=
o
+ * (as Solaris does) start when there is material free RAM left BUT the=
+ * vm system thinks it needs to be active to steal pages back in the at=
tempt
+ * to never get into the condition where you're potentially paging off
+ * executables in favor of leaving disk cache allocated.
+ *
+ * To fix this we change how we look at low memory, declaring two new
+ * runtime tunables and one status.
+ *
+ * The new sysctls are:
+ * vfs.zfs.arc_freepages (free pages required to call RAM "sufficient")=
+ * vfs.zfs.arc_freepage_percent (additional reservation percentage, def=
ault 0)
+ * vfs.zfs.arc_shrink_needed (shows "1" if we're asking for shrinking t=
he ARC)
+ *
+ * vfs.zfs.arc_freepages is initialized from vm.v_free_target.
+ * This should insure that we allow the VM system to steal pages,
+ * but pare the cache before we suspend processes attempting to get mor=
e
+ * memory, thereby avoiding "stalls." You can set this higher if you w=
ish,
+ * or force a specific percentage reservation as well, but doing so may=
+ * cause the cache to pare back while the VM system remains willing to
+ * allow "inactive" pages to accumulate. The challenge is that image
+ * activation can force things into the page space on a repeated basis
+ * if you allow this level to be too small (the above pathological
+ * behavior); the defaults should avoid that behavior but the sysctls
+ * are exposed should your workload require adjustment.
+ *
+ * If we're using this check for low memory we are replacing the previo=
us
+ * ones, including the oddball "random" reclaim that appears to fire fa=
r
+ * more often than it should. We still trigger if the system pages.
+ *
+ * If you turn on NEWRECLAIM_DEBUG then the kernel will print on the co=
nsole
+ * status messages when the reclaim status trips on and off, along with=
the
+ * page count aggregate that triggered it (and the free space) for each=
+ * event.
+ */
+
+ #define NEWRECLAIM
+ #undef NEWRECLAIM_DEBUG
+
+
/*
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights =
reserved.
* Copyright (c) 2013 by Delphix. All rights reserved.
***************
*** 139,144 ****
--- 211,223 ----
=20
#include <vm/vm_pageout.h>
=20
+ #ifdef NEWRECLAIM
+ #ifdef __FreeBSD__
+ #include <sys/sysctl.h>
+ #include <sys/vmmeter.h>
+ #endif
+ #endif /* NEWRECLAIM */
+
#ifdef illumos
#ifndef _KERNEL
/* set with ZFS_DEBUG=3Dwatch, to enable watchpoints on frozen buffers=
*/
***************
*** 203,218 ****
--- 282,320 ----
int zfs_arc_shrink_shift =3D 0;
int zfs_arc_p_min_shift =3D 0;
int zfs_disable_dup_eviction =3D 0;
+ #ifdef NEWRECLAIM
+ #ifdef __FreeBSD__
+ static int freepages =3D 0; /* This much memory is considered critical =
*/
+ static int percent_target =3D 0; /* Additionally reserve "X" percent fr=
ee RAM */
+ static int shrink_needed =3D 0; /* Shrinkage of ARC cache needed? */
+ #endif /* __FreeBSD__ */
+ #endif /* NEWRECLAIM */
=20
TUNABLE_QUAD("vfs.zfs.arc_max", &zfs_arc_max);
TUNABLE_QUAD("vfs.zfs.arc_min", &zfs_arc_min);
TUNABLE_QUAD("vfs.zfs.arc_meta_limit", &zfs_arc_meta_limit);
+ #ifdef NEWRECLAIM
+ #ifdef __FreeBSD__
+ TUNABLE_INT("vfs.zfs.arc_freepages", &freepages);
+ TUNABLE_INT("vfs.zfs.arc_freepage_percent", &percent_target);
+ TUNABLE_INT("vfs.zfs.arc_shrink_needed", &shrink_needed);
+ #endif /* __FreeBSD__ */
+ #endif /* NEWRECLAIM */
+
SYSCTL_DECL(_vfs_zfs);
SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, arc_max, CTLFLAG_RDTUN, &zfs_arc_max,=
0,
"Maximum ARC size");
SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, arc_min, CTLFLAG_RDTUN, &zfs_arc_min,=
0,
"Minimum ARC size");
=20
+ #ifdef NEWRECLAIM
+ #ifdef __FreeBSD__
+ SYSCTL_INT(_vfs_zfs, OID_AUTO, arc_freepages, CTLFLAG_RWTUN, &freepages=
, 0, "ARC Free RAM Pages Required");
+ SYSCTL_INT(_vfs_zfs, OID_AUTO, arc_freepage_percent, CTLFLAG_RWTUN, &pe=
rcent_target, 0, "ARC Free RAM Target percentage");
+ SYSCTL_INT(_vfs_zfs, OID_AUTO, arc_shrink_needed, CTLFLAG_RD, &shrink_n=
eeded, 0, "ARC Memory Constrained (0 =3D no, 1 =3D yes)");
+ #endif /* __FreeBSD__ */
+ #endif /* NEWRECLAIM */
+
/*
* Note that buffers can be in one of 6 states:
* ARC_anon - anonymous (discussed below)
***************
*** 2438,2443 ****
--- 2540,2550 ----
{
=20
#ifdef _KERNEL
+ #ifdef NEWRECLAIM_DEBUG
+ static int xval =3D -1;
+ static int oldpercent =3D 0;
+ static int oldfreepages =3D 0;
+ #endif /* NEWRECLAIM_DEBUG */
=20
if (needfree)
return (1);
***************
*** 2476,2481 ****
--- 2583,2589 ----
return (1);
=20
#if defined(__i386)
+
/*
* If we're on an i386 platform, it's possible that we'll exhaust the=
* kernel heap space before we ever run out of available physical
***************
*** 2492,2502 ****
return (1);
#endif
#else /* !sun */
if (kmem_used() > (kmem_size() * 3) / 4)
return (1);
#endif /* sun */
=20
- #else
if (spa_get_random(100) =3D=3D 0)
return (1);
#endif
--- 2600,2664 ----
return (1);
#endif
#else /* !sun */
+
+ #ifdef NEWRECLAIM
+ #ifdef __FreeBSD__
+ /*
+ * Implement the new tunable free RAM algorithm. We check the free pag=
es
+ * against the minimum specified target and the percentage that should =
be
+ * free. If we're low we ask for ARC cache shrinkage. If this is defi=
ned
+ * on a FreeBSD system the older checks are not performed.
+ *
+ * Check first to see if we need to init freepages, then test.
+ */
+ if (!freepages) { /* If zero then (re)init */
+ freepages =3D cnt.v_free_target;
+ #ifdef NEWRECLAIM_DEBUG
+ printf("ZFS ARC: Default vfs.zfs.arc_freepages to [%u]\n", freepages)=
;
+ #endif /* NEWRECLAIM_DEBUG */
+ }
+ #ifdef NEWRECLAIM_DEBUG
+ if (percent_target !=3D oldpercent) {
+ printf("ZFS ARC: Reservation percent change to [%d], [%d] pages, [%d]=
free\n", percent_target, cnt.v_page_count, cnt.v_free_count);
+ oldpercent =3D percent_target;
+ }
+ if (freepages !=3D oldfreepages) {
+ printf("ZFS ARC: Low RAM page change to [%d], [%d] pages, [%d] free\n=
", freepages, cnt.v_page_count, cnt.v_free_count);
+ oldfreepages =3D freepages;
+ }
+ #endif /* NEWRECLAIM_DEBUG */
+ /*
+ * Now figure out how much free RAM we require to call the ARC cache st=
atus
+ * "ok". Add the percentage specified of the total to the base require=
ment.
+ */
+
+ if (cnt.v_free_count < (freepages + ((cnt.v_page_count / 100) * percen=
t_target))) {
+ #ifdef NEWRECLAIM_DEBUG
+ if (xval !=3D 1) {
+ printf("ZFS ARC: RECLAIM total %u, free %u, free pct (%u), reserved =
(%u), target pct (%u)\n", cnt.v_page_count, cnt.v_free_count, ((cnt.v_fre=
e_count * 100) / cnt.v_page_count), freepages, percent_target);
+ xval =3D 1;
+ }
+ #endif /* NEWRECLAIM_DEBUG */
+ shrink_needed =3D 1;
+ return(1);
+ } else {
+ #ifdef NEWRECLAIM_DEBUG
+ if (xval !=3D 0) {
+ printf("ZFS ARC: NORMAL total %u, free %u, free pct (%u), reserved (=
%u), target pct (%u)\n", cnt.v_page_count, cnt.v_free_count, ((cnt.v_free=
_count * 100) / cnt.v_page_count), freepages, percent_target);
+ xval =3D 0;
+ }
+ #endif /* NEWRECLAIM_DEBUG */
+ shrink_needed =3D 0;
+ return(0);
+ }
+
+ #endif /* __FreeBSD__ */
+ #endif /* NEWRECLAIM */
+
if (kmem_used() > (kmem_size() * 3) / 4)
return (1);
#endif /* sun */
=20
if (spa_get_random(100) =3D=3D 0)
return (1);
#endif
--=20
-- Karl
karl at denninger.net
--------------ms090509050705090705090709
Content-Type: application/pkcs7-signature; name="smime.p7s"
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="smime.p7s"
Content-Description: S/MIME Cryptographic Signature
MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIFTzCC
BUswggQzoAMCAQICAQgwDQYJKoZIhvcNAQEFBQAwgZ0xCzAJBgNVBAYTAlVTMRAwDgYDVQQI
EwdGbG9yaWRhMRIwEAYDVQQHEwlOaWNldmlsbGUxGTAXBgNVBAoTEEN1ZGEgU3lzdGVtcyBM
TEMxHDAaBgNVBAMTE0N1ZGEgU3lzdGVtcyBMTEMgQ0ExLzAtBgkqhkiG9w0BCQEWIGN1c3Rv
bWVyLXNlcnZpY2VAY3VkYXN5c3RlbXMubmV0MB4XDTEzMDgyNDE5MDM0NFoXDTE4MDgyMzE5
MDM0NFowWzELMAkGA1UEBhMCVVMxEDAOBgNVBAgTB0Zsb3JpZGExFzAVBgNVBAMTDkthcmwg
RGVubmluZ2VyMSEwHwYJKoZIhvcNAQkBFhJrYXJsQGRlbm5pbmdlci5uZXQwggIiMA0GCSqG
SIb3DQEBAQUAA4ICDwAwggIKAoICAQC5n2KBrBmG22nVntVdvgKCB9UcnapNThrW1L+dq6th
d9l4mj+qYMUpJ+8I0rTbY1dn21IXQBoBQmy8t1doKwmTdQ59F0FwZEPt/fGbRgBKVt3Quf6W
6n7kRk9MG6gdD7V9vPpFV41e+5MWYtqGWY3ScDP8SyYLjL/Xgr+5KFKkDfuubK8DeNqdLniV
jHo/vqmIgO+6NgzPGPgmbutzFQXlxUqjiNAAKzF2+Tkddi+WKABrcc/EqnBb0X8GdqcIamO5
SyVmuM+7Zdns7D9pcV16zMMQ8LfNFQCDvbCuuQKMDg2F22x5ekYXpwjqTyfjcHBkWC8vFNoY
5aFMdyiN/Kkz0/kduP2ekYOgkRqcShfLEcG9SQ4LQZgqjMpTjSOGzBr3tOvVn5LkSJSHW2Z8
Q0dxSkvFG2/lsOWFbwQeeZSaBi5vRZCYCOf5tRd1+E93FyQfpt4vsrXshIAk7IK7f0qXvxP4
GDli5PKIEubD2Bn+gp3vB/DkfKySh5NBHVB+OPCoXRUWBkQxme65wBO02OZZt0k8Iq0i4Rci
WV6z+lQHqDKtaVGgMsHn6PoeYhjf5Al5SP+U3imTjF2aCca1iDB5JOccX04MNljvifXgcbJN
nkMgrzmm1ZgJ1PLur/ADWPlnz45quOhHg1TfUCLfI/DzgG7Z6u+oy4siQuFr9QT0MQIDAQAB
o4HWMIHTMAkGA1UdEwQCMAAwEQYJYIZIAYb4QgEBBAQDAgWgMAsGA1UdDwQEAwIF4DAsBglg
hkgBhvhCAQ0EHxYdT3BlblNTTCBHZW5lcmF0ZWQgQ2VydGlmaWNhdGUwHQYDVR0OBBYEFHw4
+LnuALyLA5Cgy7T5ZAX1WzKPMB8GA1UdIwQYMBaAFF3U3hpBZq40HB5VM7B44/gmXiI0MDgG
CWCGSAGG+EIBAwQrFilodHRwczovL2N1ZGFzeXN0ZW1zLm5ldDoxMTQ0My9yZXZva2VkLmNy
bDANBgkqhkiG9w0BAQUFAAOCAQEAZ0L4tQbBd0hd4wuw/YVqEBDDXJ54q2AoqQAmsOlnoxLO
31ehM/LvrTIP4yK2u1VmXtUumQ4Ao15JFM+xmwqtEGsh70RRrfVBAGd7KOZ3GB39FP2TgN/c
L5fJKVxOqvEnW6cL9QtvUlcM3hXg8kDv60OB+LIcSE/P3/s+0tEpWPjxm3LHVE7JmPbZIcJ1
YMoZvHh0NSjY5D0HZlwtbDO7pDz9sZf1QEOgjH828fhtborkaHaUI46pmrMjiBnY6ujXMcWD
pxtikki0zY22nrxfTs5xDWGxyrc/cmucjxClJF6+OYVUSaZhiiHfa9Pr+41okLgsRB0AmNwE
f6ItY3TI8DGCBQowggUGAgEBMIGjMIGdMQswCQYDVQQGEwJVUzEQMA4GA1UECBMHRmxvcmlk
YTESMBAGA1UEBxMJTmljZXZpbGxlMRkwFwYDVQQKExBDdWRhIFN5c3RlbXMgTExDMRwwGgYD
VQQDExNDdWRhIFN5c3RlbXMgTExDIENBMS8wLQYJKoZIhvcNAQkBFiBjdXN0b21lci1zZXJ2
aWNlQGN1ZGFzeXN0ZW1zLm5ldAIBCDAJBgUrDgMCGgUAoIICOzAYBgkqhkiG9w0BCQMxCwYJ
KoZIhvcNAQcBMBwGCSqGSIb3DQEJBTEPFw0xNDAzMjQxMTQxMTZaMCMGCSqGSIb3DQEJBDEW
BBSpjQAUz/irMw8ktf83fseTJbi4szBsBgkqhkiG9w0BCQ8xXzBdMAsGCWCGSAFlAwQBKjAL
BglghkgBZQMEAQIwCgYIKoZIhvcNAwcwDgYIKoZIhvcNAwICAgCAMA0GCCqGSIb3DQMCAgFA
MAcGBSsOAwIHMA0GCCqGSIb3DQMCAgEoMIG0BgkrBgEEAYI3EAQxgaYwgaMwgZ0xCzAJBgNV
BAYTAlVTMRAwDgYDVQQIEwdGbG9yaWRhMRIwEAYDVQQHEwlOaWNldmlsbGUxGTAXBgNVBAoT
EEN1ZGEgU3lzdGVtcyBMTEMxHDAaBgNVBAMTE0N1ZGEgU3lzdGVtcyBMTEMgQ0ExLzAtBgkq
hkiG9w0BCQEWIGN1c3RvbWVyLXNlcnZpY2VAY3VkYXN5c3RlbXMubmV0AgEIMIG2BgsqhkiG
9w0BCRACCzGBpqCBozCBnTELMAkGA1UEBhMCVVMxEDAOBgNVBAgTB0Zsb3JpZGExEjAQBgNV
BAcTCU5pY2V2aWxsZTEZMBcGA1UEChMQQ3VkYSBTeXN0ZW1zIExMQzEcMBoGA1UEAxMTQ3Vk
YSBTeXN0ZW1zIExMQyBDQTEvMC0GCSqGSIb3DQEJARYgY3VzdG9tZXItc2VydmljZUBjdWRh
c3lzdGVtcy5uZXQCAQgwDQYJKoZIhvcNAQEBBQAEggIAgRTUPXy9Gqe69794f6zBeOsb1GYt
t732rinQP9a/SadpluwziBBHL2O1NpjuaP/TPTCQIj0Tc7T02QJ8KPmsLVpRy9r115eLcQ8L
Yp/jDpRwUXKn7690gNf4NknaqmQTkiT7GN8/knSyyj3Oy3rWaTbjoAYsG5Iiu2aPiNP86SvZ
60meUP6agmELnPRfpeJuixzB225n7o8X20wkiG1iJYSLHDceuPo4oy6/OStg+efxcxxOrBrq
PIMTn5pXK0iNKLxgyHWm3We3jLXDq4NLBL844LJ1tuj1Axp++rwwhgs7aNHvwSwFc1iDh+KB
UjxL0HTC5sapGdcyEFLcOW/SL400sZOlxBjmHYCHQ/2toNiUdc9CsOiDmgMrkFjOvHrWqsuX
wHFra919HLtiqdUy3TxYLDh+3toa/1BW/DEEYDtWPqjWcoHIp2RasLAeJl9HAqlU/KgqfrUa
eM0mnAEVa0qx5/KaGFqN1sl9EYhIJJgVTsQpb2Xk84p4c2ANxoK2uZ912pNHcq7tiplVd0F+
WuYrYVkaXh+QJARJo3+GPzc9UnErDHLQSMYLBVQzhuA7CRDo/Orb2kUubZxWsD+9ztL/A8Wd
ElW4DDD/or1xFdCsFPllvxFdiwBGKLccyqyPHQzgVQS+Sgi0vL3Ph7RgKGkUf+qGJxBRs97s
g58oRFcAAAAAAAA=
--------------ms090509050705090705090709--
More information about the freebsd-fs
mailing list