Re: Chasing OOM Issues - good sysctl metrics to use?

From: Mark Millard <marklmi_at_yahoo.com>
Date: Wed, 11 May 2022 19:52:28 UTC
On 2022-May-10, at 20:31, Mark Millard <marklmi@yahoo.com> wrote:

> On 2022-May-10, at 17:49, Mark Millard <marklmi@yahoo.com> wrote:
> 
>> On 2022-May-10, at 11:49, Mark Millard <marklmi@yahoo.com> wrote:
>> 
>>> On 2022-May-10, at 08:47, Jan Mikkelsen <janm@transactionware.com> wrote:
>>> 
>>>> On 10 May 2022, at 10:01, Mark Millard <marklmi@yahoo.com> wrote:
>>>>> 
>>>>> On 2022-Apr-29, at 13:57, Mark Millard <marklmi@yahoo.com> wrote:
>>>>> 
>>>>>> On 2022-Apr-29, at 13:41, Pete Wright <pete@nomadlogic.org> wrote:
>>>>>>> 
>>>>>>>> . . .
>>>>>>> 
>>>>>>> d'oh - went out for lunch and workstation locked up.  i *knew* i shouldn't have said anything lol.
>>>>>> 
>>>>>> Any interesting console messages ( or dmesg -a or /var/log/messages )?
>>>>>> 
>>>>> 
>>>>> I've been doing some testing of a patch by tijl at FreeBSD.org
>>>>> and have reproduced both hang-ups (ZFS/ARC context) and kills
>>>>> (UFS/noARC and ZFS/ARC) for "was killed: failed to reclaim
>>>>> memory", both with and without the patch. This is with only a
>>>>> tiny fraction of the swap partition(s) enabled being put to
>>>>> use. So far, the testing was deliberately with
>>>>> vm.pageout_oom_seq=12 (the default value). My testing has been
>>>>> with main [so: 14].
>>>>> 
>>>>> But I also learned how to avoid the hang-ups that I got --but
>>>>> it costs making kills more likely/quicker, other things being
>>>>> equal.
>>>>> 
>>>>> I discovered that the hang-ups that I got were from all the
>>>>> processes that I interact with the system via ending up with
>>>>> the process's kernel threads swapped out and were not being
>>>>> swapped in. (including sshd, so no new ssh connections). In
>>>>> some contexts I only had escaping into the kernel debugger
>>>>> available, not even ^T would work. Other times ^T did work.
>>>>> 
>>>>> So, when I'm willing to risk kills in order to maintain
>>>>> the ability to interact normally, I now use in
>>>>> /etc/sysctl.conf :
>>>>> 
>>>>> vm.swap_enabled=0
>>>> 
>>>> I have been looking at an OOM related issue. Ignoring the actual leak, the problem leads to a process being killed because the system was out of memory. This is fine. After that, however, the system console was black with a single block cursor and the console keyboard was unresponsive. Caps lock and num lock didn’t toggle their lights when pressed.
>>>> 
>>>> Using an ssh session, the system looked fine. USB events for the keyboard being disconnected and reconnected appeared but the keyboard stayed unresponsive.
>>>> 
>>>> Setting vm.swap_enabled=0, as you did above, resolved this problem. After the process was killed a perfectly normal console returned.
>>>> 
>>>> The interesting thing is that this test system is configured with no swap space.
>>>> 
>>>> This is on 13.1-RC5.
>>>> 
>>>>> This disables swapping out of process kernel stacks. It
>>>>> is just with that option removedfor gaining free RAM, there
>>>>> fewer options tried before a kill is initiated. It is not a
>>>>> loader-time tunable but is writable, thus the
>>>>> /etc/sysctl.conf placement.
>>>> 
>>>> Is that really what it does? From a quick look at the code in vm/vm_swapout.c, it seems little more complex.
>>> 
>>> I was going by its description:
>>> 
>>> # sysctl -d vm.swap_enabled
>>> vm.swap_enabled: Enable entire process swapout
>>> 
>>> Based on the below, it appears that the description
>>> presumes vm.swap_idle_enabled==0 (the default). In
>>> my context vm.swap_idle_enabled==0 . Looks like I
>>> should also list:
>>> 
>>> vm.swap_idle_enabled=0
>>> 
>>> in my /etc/sysctl.conf with a reminder comment that the
>>> pair of =0's are required for avoiding the observed
>>> hang-ups.
>>> 
>>> 
>>> The  analysis goes like . . .
>>> 
>>> I see in the code that vm.swap_enabled !=0 causes
>>> VM_SWAP_NORMAL :
>>> 
>>> void
>>> vm_swapout_run(void)
>>> {
>>> 
>>>      if (vm_swap_enabled)
>>>              vm_req_vmdaemon(VM_SWAP_NORMAL);
>>> }
>>> 
>>> and that in turn leads to vm_daemon to:
>>> 
>>>              if (swapout_flags != 0) {
>>>                      /*
>>>                       * Drain the per-CPU page queue batches as a deadlock
>>>                       * avoidance measure.
>>>                       */
>>>                      if ((swapout_flags & VM_SWAP_NORMAL) != 0)
>>>                              vm_page_pqbatch_drain();
>>>                      swapout_procs(swapout_flags);
>>>              }
>>> 
>>> Note: vm.swap_idle_enabled==0 && vm.swap_enabled==0 ends
>>> up with swapout_flags==0. vm.swap_idle. . . defaults seem
>>> to be (in my context):
>>> 
>>> # sysctl -a | grep swap_idle
>>> vm.swap_idle_threshold2: 10
>>> vm.swap_idle_threshold1: 2
>>> vm.swap_idle_enabled: 0
>>> 
>>> For reference:
>>> 
>>> /*
>>> * Idle process swapout -- run once per second when pagedaemons are
>>> * reclaiming pages.
>>> */
>>> void
>>> vm_swapout_run_idle(void)
>>> {
>>>      static long lsec;
>>> 
>>>      if (!vm_swap_idle_enabled || time_second == lsec)
>>>              return;
>>>      vm_req_vmdaemon(VM_SWAP_IDLE);
>>>      lsec = time_second;
>>> }
>>> 
>>> [So vm.swap_idle_enabled==0 avoids VM_SWAP_IDLE status.]
>>> 
>>> static void
>>> vm_req_vmdaemon(int req)
>>> {
>>>      static int lastrun = 0;
>>> 
>>>      mtx_lock(&vm_daemon_mtx);
>>>      vm_pageout_req_swapout |= req;
>>>      if ((ticks > (lastrun + hz)) || (ticks < lastrun)) {
>>>              wakeup(&vm_daemon_needed);
>>>              lastrun = ticks;
>>>      }
>>>      mtx_unlock(&vm_daemon_mtx);
>>> }
>>> 
>>> [So VM_SWAP_IDLE and VM_SWAP_NORMAL are independent bits
>>> in vm_pageout_req_swapout.]
>>> 
>>> vm_deamon does:
>>> 
>>>              mtx_lock(&vm_daemon_mtx);
>>>              msleep(&vm_daemon_needed, &vm_daemon_mtx, PPAUSE, "psleep",
>>>                  vm_daemon_timeout);
>>>              swapout_flags = vm_pageout_req_swapout;
>>>              vm_pageout_req_swapout = 0;
>>>              mtx_unlock(&vm_daemon_mtx);
>>> 
>>> So vm_pageout_req_swapout is regenerated after thata
>>> each time.
>>> 
>>> I'll not show the code for vm.swap_idle_enabled!=0 .
>>> 
>> 
>> Well, with continued experiments I got an example of
>> a hangup for which looking via the db> prompt did not
>> show any swapping out of process kernel stacks
>> ( vm.swap_enabled=0 was the context, so expected ).
>> The environment was ZFS (so with ARC).
>> 
>> But this was testing with vm.pageout_oom_seq=120 instead
>> of the default vm.pageout_oom_seq=12 . It may be that
>> let sit long enough things would have unhung (external
>> perspective).
>> 
>> It is part of what I'm experimenting with so we will see.
>> 
> 
> Looks like I might have overreacted, in that for my
> current tests there can be brief periods of delayed
> response, but things respond in a little bit.
> Definately not like the hang-ups I was getting with
> vm.swap_enabled=1 .
> 

The following is based on using vm.pageout_oom_seq=120
which greatly delays kills. (I've never waited long
enough.) vm.pageout_oom_seq=12 tends to get a kill
fairly quickly, making the below hard to observe.

More testing has shown it can hang up with use of
vm.swap_enabled=0 with vm.swap_idle_enabled=0 --but
the details I've observed suggest a livelock rather
than a deadlock. It appears that the likes of (db>
use output extractions):

1171  1168  1168     0  R+      CPU 2                       stress
1170  1168  1168     0  R+      CPU 0                       stress
and:
 18     0     0     0  RL      (threaded)                  [pagedaemon]
100120                   Run     CPU 1                       [dom0]
100132                   D       launds  0xffff000000f1dc44  [laundry: dom0]
100133                   D       umarcl  0xffff0000007d8424  [uma]

stay busy using power like when I have just those
significantly active and the system is not hung-up.
(30.6W..30.8W or so range, where idle is more like
26W and more general activity being involved ends
up with the power jumping around over a wider
range, for example.)

I have observed non-hung-up tests where the 2 stress
processes using the memory were getting around 99%
in top and and [pagedaemon{dom0}] was getting around 90%
but a grep was getting more like 0.04%. This looks like
a near-livelock and it was what inspired looking if more
suggested a livelock for a hang-up.

Looking via db> use always has looked like the above.
(Sometimes I've used 3 memory-using stress processes but
now usually 2, leaving one CPU typically being idle.)

That in turn lead to monitoring the power, ending up as
mentioned above.

I have also observed hang-up-like cases where the top
that had been running would sometimes get individual
screen updates many minutes apart. With the power usage
pattern it again seems like a (near) livelock.


Relative to avoiding hang-ups, so far it seems that
use of vm.swap_enabled=0 with vm.swap_idle_enabled=0
makes hang-ups less likely/less frequent/harder to
produce examples of. But is no guarantee of lack of
a hang-up. Its does change the cause of the hang-up
(in that it avoids processes with kernel stacks swapped
out being involved).

What I do to avoid rebooting for a hang-up I'd done
with is to kill the memory using stress processes
via db> use and then c out of the kernel debugger
(i.e., continue). So far the system has always
returned to normal in response.

===
Mark Millard
marklmi at yahoo.com