The out-of-swap killer makes poor choices
Mark Millard
marklmi at yahoo.com
Wed Feb 24 23:53:42 UTC 2021
On 2021-Feb-24, at 11:59, Mark Millard <marklmi at yahoo.com> wrote:
> On 2021-Feb-24, at 10:36, Konstantin Belousov <kostikbel T gmail.com> wrote:
>
>> On Wed, Feb 24, 2021 at 10:34:23AM -0700, Alan Somers wrote:
>>> There's another silly problem that I didn't mention in my original post.
>>> The old rule of thumb is that the swap partition's size should be twice as
>>> large as the amount of RAM. However, that's no longer possible in many
>>> cases. The kernel imposes a hard limit of 64 GiB (on amd64 at least) on
>>> the usable size of any swap partition, and many servers now have far more
>>> than 64 GiB of RAM. So the advice needs to change with the times. I don't
>> I do not think so. The usable size of the swap is determined by the
>> amount of swap metadata we pre-configure at boot time. Usually it is
>> sized proportionally to the available physical memory, but you can
>> override swap zones size manually with the knob.
>
> There was a period of time when the 128 GiByte RAM ThreadRipper
> had its previous 192 GiByte swap partition use rejected and I
> had to split it into 3 64 GiByte ones. Later I saw a checkin that
> was a correction to some calculation (vague memory) and I retried
> having one 192 GiByte swap partition and it was again allowed.
>
> The ability to dump to a swap partition when there was a
> 64 GiByte limitation with 128 GiByte of RAM had implications
> for the configuration. I actually arranged having a partition
> that was only used for dump's potential use. That took some
> rearrangement to form a large enough space, making other
> tradeoffs to do so.
>
>
> (I'm not sure if I can find the commit that lead to me switching
> back to more than 64 GiByte for a swap file on the large memory
> machine. I do not remember details any more.)
The 64 GiByte size limit (as seen in my environment) was
replaced in:
https://cgit.freebsd.org/src/commit/sys/vm/swap_pager.c?id=00fd73d2dabdee2638203dd1145f007787f05be9
a.k.a.:
https://svnweb.freebsd.org/base?view=revision&revision=363532
QUOTE
author Doug Moore <dougm at FreeBSD.org> 2020-07-25 18:29:10 +0000
committer Doug Moore <dougm at FreeBSD.org> 2020-07-25 18:29:10 +0000
. . .
Fix an overflow bug in the blist allocator that needlessly capped max
swap size by dividing a value, which was always a multiple of 64, by
64. Remove the code that reduced max swap size down to that cap.
Eliminate the distinction between BLIST_BMAP_RADIX and
BLIST_META_RADIX. Call them both BLIST_RADIX.
Make improvments to the blist self-test code to silence compiler
warnings and to test larger blists.
Reported by: jmallett
Reviewed by: alc
Discussed with: kib
Tested by: pho
Differential Revision:
https://reviews.freebsd.org/D25736
Notes
Notes:
svn path=/head/; revision=363532
END QUOTE
Evidence sequence leading me there:
Establish a large swap partition on a device with
an old snapshot of my ThreadRipper environment,
resulting in:
# gpart show -pl nvd1
=> 40 937703008 nvd1 GPT (447G)
40 1024 nvd1p1 FBSDFSSDboot (512K)
1064 746586112 nvd1p2 FBSDFSSDroot (356G)
746587176 191115872 nvd1p3 FBSDFSSDswap (91G)
I got a kernel from the ci.freebsd.org artifacts and put
it in place on the old snapshot of my ThreadRipper environment
(that no longer could even boot --ACPI incompatibilities), so
updating the old failing kernel but leaving the rest unchanged:
# uname -apKU
FreeBSD FBSDFSSD 13.0-CURRENT FreeBSD 13.0-CURRENT #0 r358314: Tue Feb 25 18:08:20 UTC 2020 root at FreeBSD-head-amd64-build.jail.ci.FreeBSD.org:/usr/obj/usr/src/amd64.amd64/sys/GENERIC amd64 amd64 1300081 1300037
So: old head (13) environment booted on the 128 GiByte
ThreadRipper:
From /var/log/messages:
WARNING: reducing swap size to maximum of 65536MB per unit
# swapinfo
Device 1K-blocks Used Avail Capacity
/dev/gpt/FBSDFSSDswap 67108864 0 67108864 0%
The code that produced the message and limited
the size was in sys/vm/swap_pager.c back in that
time frame:
static void
swaponsomething(struct vnode *vp, void *id, u_long nblks,
sw_strategy_t *strategy, sw_close_t *close, dev_t dev, int flags)
{
struct swdevt *sp, *tsp;
swblk_t dvbase;
u_long mblocks;
/*
* nblks is in DEV_BSIZE'd chunks, convert to PAGE_SIZE'd chunks.
* First chop nblks off to page-align it, then convert.
*
* sw->sw_nblks is in page-sized chunks now too.
*/
nblks &= ~(ctodb(1) - 1);
nblks = dbtoc(nblks);
/*
* If we go beyond this, we get overflows in the radix
* tree bitmap code.
*/
mblocks = 0x40000000 / BLIST_META_RADIX;
if (nblks > mblocks) {
printf(
"WARNING: reducing swap size to maximum of %luMB per unit\n",
mblocks / 1024 / 1024 * PAGE_SIZE);
nblks = mblocks;
}
. . .
Then I used blame to find the fix in git via looking at:
https://cgit.freebsd.org/src/blame/sys/vm/swap_pager.c
>> know what the best size would be for a modern server, but I would guess
>>> that it must be at least several times the RSS of your largest process, and
>>> also at least one tenth of RAM (for use as a dump device with compressed
>>> core dumps).
===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)
More information about the freebsd-hackers
mailing list