The out-of-swap killer makes poor choices

Wed Feb 24 23:53:42 UTC 2021

On 2021-Feb-24, at 11:59, Mark Millard <marklmi at yahoo.com> wrote:

> On 2021-Feb-24, at 10:36, Konstantin Belousov <kostikbel T gmail.com> wrote:
> 
>> On Wed, Feb 24, 2021 at 10:34:23AM -0700, Alan Somers wrote:
>>> There's another silly problem that I didn't mention in my original post.
>>> The old rule of thumb is that the swap partition's size should be twice as
>>> large as the amount of RAM.  However, that's no longer possible in many
>>> cases.  The kernel imposes a hard limit of 64 GiB (on amd64 at least) on
>>> the usable size of any swap partition, and many servers now have far more
>>> than 64 GiB of RAM.  So the advice needs to change with the times.  I don't
>> I do not think so. The usable size of the swap is determined by the
>> amount of swap metadata we pre-configure at boot time. Usually it is
>> sized proportionally to the available physical memory, but you can
>> override swap zones size manually with the knob.
> 
> There was a period of time when the 128 GiByte RAM ThreadRipper
> had its previous 192 GiByte swap partition use rejected and I
> had to split it into 3 64 GiByte ones. Later I saw a checkin that
> was a correction to some calculation (vague memory) and I retried
> having one 192 GiByte swap partition and it was again allowed.
> 
> The ability to dump to a swap partition when there was a
> 64 GiByte limitation with 128 GiByte of RAM had implications
> for the configuration. I actually arranged having a partition
> that was only used for dump's potential use. That took some
> rearrangement to form a large enough space, making other
> tradeoffs to do so.
> 
> 
> (I'm not sure if I can find the commit that lead to me switching
> back to more than 64 GiByte for a swap file on the large memory
> machine. I do not remember details any more.)

The 64 GiByte size limit (as seen in my environment) was
replaced in:

https://cgit.freebsd.org/src/commit/sys/vm/swap_pager.c?id=00fd73d2dabdee2638203dd1145f007787f05be9
a.k.a.:
https://svnweb.freebsd.org/base?view=revision&revision=363532

QUOTE
author	Doug Moore <dougm at FreeBSD.org>	2020-07-25 18:29:10 +0000
committer	Doug Moore <dougm at FreeBSD.org>	2020-07-25 18:29:10 +0000
. . .

Fix an overflow bug in the blist allocator that needlessly capped max
swap size by dividing a value, which was always a multiple of 64, by
64.  Remove the code that reduced max swap size down to that cap.

Eliminate the distinction between BLIST_BMAP_RADIX and
BLIST_META_RADIX.  Call them both BLIST_RADIX.

Make improvments to the blist self-test code to silence compiler
warnings and to test larger blists.

Reported by:	jmallett
Reviewed by:	alc
Discussed with:	kib
Tested by:	pho
Differential Revision:	
https://reviews.freebsd.org/D25736

Notes
Notes:
    svn path=/head/; revision=363532
END QUOTE

Evidence sequence leading me there:

Establish a large swap partition on a device with
an old snapshot of my ThreadRipper environment,
resulting in:

# gpart show -pl nvd1
=>       40  937703008    nvd1  GPT  (447G)
         40       1024  nvd1p1  FBSDFSSDboot  (512K)
       1064  746586112  nvd1p2  FBSDFSSDroot  (356G)
  746587176  191115872  nvd1p3  FBSDFSSDswap  (91G)

I got a kernel from the ci.freebsd.org artifacts and put
it in place on the old snapshot of my ThreadRipper environment
(that no longer could even boot --ACPI incompatibilities), so
updating the old failing kernel but leaving the rest unchanged:

# uname -apKU
FreeBSD FBSDFSSD 13.0-CURRENT FreeBSD 13.0-CURRENT #0 r358314: Tue Feb 25 18:08:20 UTC 2020     root at FreeBSD-head-amd64-build.jail.ci.FreeBSD.org:/usr/obj/usr/src/amd64.amd64/sys/GENERIC  amd64 amd64 1300081 1300037

So: old head (13) environment booted on the 128 GiByte
ThreadRipper:

From /var/log/messages:

WARNING: reducing swap size to maximum of 65536MB per unit

# swapinfo
Device          1K-blocks     Used    Avail Capacity
/dev/gpt/FBSDFSSDswap  67108864        0 67108864     0%

The code that produced the message and limited
the size was in sys/vm/swap_pager.c back in that
time frame:

static void
swaponsomething(struct vnode *vp, void *id, u_long nblks,
    sw_strategy_t *strategy, sw_close_t *close, dev_t dev, int flags)
{
        struct swdevt *sp, *tsp;
        swblk_t dvbase;
        u_long mblocks;

        /*
         * nblks is in DEV_BSIZE'd chunks, convert to PAGE_SIZE'd chunks.
         * First chop nblks off to page-align it, then convert.
         *
         * sw->sw_nblks is in page-sized chunks now too.
         */
        nblks &= ~(ctodb(1) - 1);
        nblks = dbtoc(nblks);

        /*
         * If we go beyond this, we get overflows in the radix
         * tree bitmap code.
         */
        mblocks = 0x40000000 / BLIST_META_RADIX;
        if (nblks > mblocks) {
                printf(
    "WARNING: reducing swap size to maximum of %luMB per unit\n",
                    mblocks / 1024 / 1024 * PAGE_SIZE);
                nblks = mblocks;
        }
. . .

Then I used blame to find the fix in git via looking at:

https://cgit.freebsd.org/src/blame/sys/vm/swap_pager.c

>> know what the best size would be for a modern server, but I would guess
>>> that it must be at least several times the RSS of your largest process, and
>>> also at least one tenth of RAM (for use as a dump device with compressed
>>> core dumps).

===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)