Re: Chasing OOM Issues - good sysctl metrics to use?
Date: Wed, 11 May 2022 19:52:28 UTC
On 2022-May-10, at 20:31, Mark Millard <marklmi@yahoo.com> wrote: > On 2022-May-10, at 17:49, Mark Millard <marklmi@yahoo.com> wrote: > >> On 2022-May-10, at 11:49, Mark Millard <marklmi@yahoo.com> wrote: >> >>> On 2022-May-10, at 08:47, Jan Mikkelsen <janm@transactionware.com> wrote: >>> >>>> On 10 May 2022, at 10:01, Mark Millard <marklmi@yahoo.com> wrote: >>>>> >>>>> On 2022-Apr-29, at 13:57, Mark Millard <marklmi@yahoo.com> wrote: >>>>> >>>>>> On 2022-Apr-29, at 13:41, Pete Wright <pete@nomadlogic.org> wrote: >>>>>>> >>>>>>>> . . . >>>>>>> >>>>>>> d'oh - went out for lunch and workstation locked up. i *knew* i shouldn't have said anything lol. >>>>>> >>>>>> Any interesting console messages ( or dmesg -a or /var/log/messages )? >>>>>> >>>>> >>>>> I've been doing some testing of a patch by tijl at FreeBSD.org >>>>> and have reproduced both hang-ups (ZFS/ARC context) and kills >>>>> (UFS/noARC and ZFS/ARC) for "was killed: failed to reclaim >>>>> memory", both with and without the patch. This is with only a >>>>> tiny fraction of the swap partition(s) enabled being put to >>>>> use. So far, the testing was deliberately with >>>>> vm.pageout_oom_seq=12 (the default value). My testing has been >>>>> with main [so: 14]. >>>>> >>>>> But I also learned how to avoid the hang-ups that I got --but >>>>> it costs making kills more likely/quicker, other things being >>>>> equal. >>>>> >>>>> I discovered that the hang-ups that I got were from all the >>>>> processes that I interact with the system via ending up with >>>>> the process's kernel threads swapped out and were not being >>>>> swapped in. (including sshd, so no new ssh connections). In >>>>> some contexts I only had escaping into the kernel debugger >>>>> available, not even ^T would work. Other times ^T did work. >>>>> >>>>> So, when I'm willing to risk kills in order to maintain >>>>> the ability to interact normally, I now use in >>>>> /etc/sysctl.conf : >>>>> >>>>> vm.swap_enabled=0 >>>> >>>> I have been looking at an OOM related issue. Ignoring the actual leak, the problem leads to a process being killed because the system was out of memory. This is fine. After that, however, the system console was black with a single block cursor and the console keyboard was unresponsive. Caps lock and num lock didn’t toggle their lights when pressed. >>>> >>>> Using an ssh session, the system looked fine. USB events for the keyboard being disconnected and reconnected appeared but the keyboard stayed unresponsive. >>>> >>>> Setting vm.swap_enabled=0, as you did above, resolved this problem. After the process was killed a perfectly normal console returned. >>>> >>>> The interesting thing is that this test system is configured with no swap space. >>>> >>>> This is on 13.1-RC5. >>>> >>>>> This disables swapping out of process kernel stacks. It >>>>> is just with that option removedfor gaining free RAM, there >>>>> fewer options tried before a kill is initiated. It is not a >>>>> loader-time tunable but is writable, thus the >>>>> /etc/sysctl.conf placement. >>>> >>>> Is that really what it does? From a quick look at the code in vm/vm_swapout.c, it seems little more complex. >>> >>> I was going by its description: >>> >>> # sysctl -d vm.swap_enabled >>> vm.swap_enabled: Enable entire process swapout >>> >>> Based on the below, it appears that the description >>> presumes vm.swap_idle_enabled==0 (the default). In >>> my context vm.swap_idle_enabled==0 . Looks like I >>> should also list: >>> >>> vm.swap_idle_enabled=0 >>> >>> in my /etc/sysctl.conf with a reminder comment that the >>> pair of =0's are required for avoiding the observed >>> hang-ups. >>> >>> >>> The analysis goes like . . . >>> >>> I see in the code that vm.swap_enabled !=0 causes >>> VM_SWAP_NORMAL : >>> >>> void >>> vm_swapout_run(void) >>> { >>> >>> if (vm_swap_enabled) >>> vm_req_vmdaemon(VM_SWAP_NORMAL); >>> } >>> >>> and that in turn leads to vm_daemon to: >>> >>> if (swapout_flags != 0) { >>> /* >>> * Drain the per-CPU page queue batches as a deadlock >>> * avoidance measure. >>> */ >>> if ((swapout_flags & VM_SWAP_NORMAL) != 0) >>> vm_page_pqbatch_drain(); >>> swapout_procs(swapout_flags); >>> } >>> >>> Note: vm.swap_idle_enabled==0 && vm.swap_enabled==0 ends >>> up with swapout_flags==0. vm.swap_idle. . . defaults seem >>> to be (in my context): >>> >>> # sysctl -a | grep swap_idle >>> vm.swap_idle_threshold2: 10 >>> vm.swap_idle_threshold1: 2 >>> vm.swap_idle_enabled: 0 >>> >>> For reference: >>> >>> /* >>> * Idle process swapout -- run once per second when pagedaemons are >>> * reclaiming pages. >>> */ >>> void >>> vm_swapout_run_idle(void) >>> { >>> static long lsec; >>> >>> if (!vm_swap_idle_enabled || time_second == lsec) >>> return; >>> vm_req_vmdaemon(VM_SWAP_IDLE); >>> lsec = time_second; >>> } >>> >>> [So vm.swap_idle_enabled==0 avoids VM_SWAP_IDLE status.] >>> >>> static void >>> vm_req_vmdaemon(int req) >>> { >>> static int lastrun = 0; >>> >>> mtx_lock(&vm_daemon_mtx); >>> vm_pageout_req_swapout |= req; >>> if ((ticks > (lastrun + hz)) || (ticks < lastrun)) { >>> wakeup(&vm_daemon_needed); >>> lastrun = ticks; >>> } >>> mtx_unlock(&vm_daemon_mtx); >>> } >>> >>> [So VM_SWAP_IDLE and VM_SWAP_NORMAL are independent bits >>> in vm_pageout_req_swapout.] >>> >>> vm_deamon does: >>> >>> mtx_lock(&vm_daemon_mtx); >>> msleep(&vm_daemon_needed, &vm_daemon_mtx, PPAUSE, "psleep", >>> vm_daemon_timeout); >>> swapout_flags = vm_pageout_req_swapout; >>> vm_pageout_req_swapout = 0; >>> mtx_unlock(&vm_daemon_mtx); >>> >>> So vm_pageout_req_swapout is regenerated after thata >>> each time. >>> >>> I'll not show the code for vm.swap_idle_enabled!=0 . >>> >> >> Well, with continued experiments I got an example of >> a hangup for which looking via the db> prompt did not >> show any swapping out of process kernel stacks >> ( vm.swap_enabled=0 was the context, so expected ). >> The environment was ZFS (so with ARC). >> >> But this was testing with vm.pageout_oom_seq=120 instead >> of the default vm.pageout_oom_seq=12 . It may be that >> let sit long enough things would have unhung (external >> perspective). >> >> It is part of what I'm experimenting with so we will see. >> > > Looks like I might have overreacted, in that for my > current tests there can be brief periods of delayed > response, but things respond in a little bit. > Definately not like the hang-ups I was getting with > vm.swap_enabled=1 . > The following is based on using vm.pageout_oom_seq=120 which greatly delays kills. (I've never waited long enough.) vm.pageout_oom_seq=12 tends to get a kill fairly quickly, making the below hard to observe. More testing has shown it can hang up with use of vm.swap_enabled=0 with vm.swap_idle_enabled=0 --but the details I've observed suggest a livelock rather than a deadlock. It appears that the likes of (db> use output extractions): 1171 1168 1168 0 R+ CPU 2 stress 1170 1168 1168 0 R+ CPU 0 stress and: 18 0 0 0 RL (threaded) [pagedaemon] 100120 Run CPU 1 [dom0] 100132 D launds 0xffff000000f1dc44 [laundry: dom0] 100133 D umarcl 0xffff0000007d8424 [uma] stay busy using power like when I have just those significantly active and the system is not hung-up. (30.6W..30.8W or so range, where idle is more like 26W and more general activity being involved ends up with the power jumping around over a wider range, for example.) I have observed non-hung-up tests where the 2 stress processes using the memory were getting around 99% in top and and [pagedaemon{dom0}] was getting around 90% but a grep was getting more like 0.04%. This looks like a near-livelock and it was what inspired looking if more suggested a livelock for a hang-up. Looking via db> use always has looked like the above. (Sometimes I've used 3 memory-using stress processes but now usually 2, leaving one CPU typically being idle.) That in turn lead to monitoring the power, ending up as mentioned above. I have also observed hang-up-like cases where the top that had been running would sometimes get individual screen updates many minutes apart. With the power usage pattern it again seems like a (near) livelock. Relative to avoiding hang-ups, so far it seems that use of vm.swap_enabled=0 with vm.swap_idle_enabled=0 makes hang-ups less likely/less frequent/harder to produce examples of. But is no guarantee of lack of a hang-up. Its does change the cause of the hang-up (in that it avoids processes with kernel stacks swapped out being involved). What I do to avoid rebooting for a hang-up I'd done with is to kill the memory using stress processes via db> use and then c out of the kernel debugger (i.e., continue). So far the system has always returned to normal in response. === Mark Millard marklmi at yahoo.com