Re: curious crashes when under memory pressure

From: Peter 'PMc' Much <pmc_at_citylink.dinoex.sub.org>
Date: Sat, 04 Jan 2025 17:29:31 UTC
On Sat, Jan 04, 2025 at 08:27:06AM -0800, Chris Torek wrote:
! On Sat, Jan 4, 2025 at 7:01 AM Peter 'PMc' Much
! <pmc@citylink.dinoex.sub.org> wrote:
! >> I'm swapping to a zfs mirror
! >
! > Well, You shouldn't do that.
! 
! Why not? Swapping to a *file* on zfs has obvious issues, but swapping
! to a mirrored swap partition seems like it should be entirely safe. A

A "mirrored swap partition" - that would be a zfs volume inside a zfs
pool which runs on some vdevs which happen to be mirrored, right?
I don't know of zfs itself having any notion of "partitions". It
supports volumes, and these have almost all the same features as
filesystems: checksumming, compression, txg buffering, logging,
snapshoting, etc.

So I tend to doubt such being safe. I can't give You logical proof
(it's more than ten years since I looked deeper into the zfs source),
but my belly feeling says there is so many creepy things going on
in the zfs layer nowadays (and very likely a bunch of undiscovered
bugs also), that one should avoid such a stack.
Also, the idea of paging into zfs got popular about the same time when
it got popular to normally not use swap at all, as lots of memory got
available. And while running a system with serious paging (into tens
of GB) is practical, it is probably not the usecase where we would
page into zfs.

A zfs vdev is logically just a fixed-length file - aka a raw partition.
Then above that thing is the zfs logic, with lots of caches. There
is not only the ARC where data must go thru, there is other dbuf
handling, there is more handling on the vdev layer, and all of that
needs some memory. (I looked into these various buffers when I patched
things so zfs gets a bit more NUMA-friendly - many of them use the
UMA allocator scheme, which again has it's own mechanics.)
Then above all this memory consuming stuff comes finally the kernel
that wants to pageout, and would expect the pageout going directly
onto a fixed-length file, aka a raw partition.
That doesn't look very sane to me, so what I am saying is: before
you spend time hunting this bug, give it a try with direct
raw-partition paging. At least then we know if it happens there also,
or not - and that helps narrowing the search.

! bit slow (double writes) but I spent $ on RAM rather than M.2 drives
! on the theory that I can add those later as needed.

It doesn't need superfast SSD, at least not for testing. Pageout
happens async, and while pagein stalls the concerned process, it is
read, and read should be faster.

cheerio,
PMc