Re: Chasing OOM Issues - good sysctl metrics to use?

Reply: Mark Millard : "Re: Chasing OOM Issues - good sysctl metrics to use?"
In reply to: Pete Wright : "Re: Chasing OOM Issues - good sysctl metrics to use?"
Go to: [ bottom of page ] [ top of archives ] [ this month ]

From: Mark Millard <marklmi_at_yahoo.com>
Date: Fri, 29 Apr 2022 20:57:56 UTC

On 2022-Apr-29, at 13:41, Pete Wright <pete@nomadlogic.org> wrote:
> 
> On 4/29/22 11:38, Mark Millard wrote:
>> On 2022-Apr-29, at 11:08, Pete Wright <pete@nomadlogic.org> wrote:
>> 
>>> On 4/23/22 19:20, Pete Wright wrote:
>>>>> The developers handbook has a section debugging deadlocks that he
>>>>> referenced in a response to another report (on freebsd-hackers).
>>>>> 
>>>>> https://docs.freebsd.org/en/books/developers-handbook/kerneldebug/#kerneldebug-deadlocks
>>>> d'oh - thanks for the correction!
>>>> 
>>>> -pete
>>>> 
>>>> 
>>> hello, i just wanted to provide an update on this issue.  so the good news is that by removing the file backed swap the deadlocks have indeed gone away!  thanks for sorting me out on that front Mark!
>> Glad it helped.
> 
> d'oh - went out for lunch and workstation locked up.  i *knew* i shouldn't have said anything lol.

Any interesting console messages ( or dmesg -a or /var/log/messages )?

>>> i still am seeing a memory leak with either firefox or chrome (maybe both where they create a voltron of memory leaks?).  this morning firefox and chrome had been killed when i first logged in. fortunately the system has remained responsive for several hours which was not the case previously.
>>> 
>>> when looking at my metrics i see vm.domain.0.stats.inactive take a nose dive from around 9GB to 0 over the course of 1min.  the timing seems to align with around the time when firefox crashed, and is proceeded by a large spike in vm.domain.0.stats.active from ~1GB to 7GB 40mins before the apps crashed.  after the binaries were killed memory metrics seem to have recovered (laundry size grew, and inactive size grew by several gigs for example).
>> Since the form of kill here is tied to sustained low free memory
>> ("failed to reclaim memory"), you might want to report the
>> vm.domain.0.stats.free_count figures from various time frames as
>> well:
>> 
>> vm.domain.0.stats.free_count: Free pages
>> 
>> (It seems you are converting pages to byte counts in your report,
>> the units I'm not really worried about so long as they are
>> obvious.)
>> 
>> There are also figures possibly tied to the handling of the kill
>> activity but some being more like thresholds than usage figures,
>> such as:
>> 
>> vm.domain.0.stats.free_severe: Severe free pages
>> vm.domain.0.stats.free_min: Minimum free pages
>> vm.domain.0.stats.free_reserved: Reserved free pages
>> vm.domain.0.stats.free_target: Target free pages
>> vm.domain.0.stats.inactive_target: Target inactive pages
> ok thanks Mark, based on this input and the fact i did manage to lock up my system, i'm going to get some metrics up on my website and share them publicly when i have time.  i'll definitely take you input into account when sharing this info.
> 
>> 
>> Also, what value were you using for:
>> 
>> vm.pageout_oom_seq
> $ sysctl vm.pageout_oom_seq
> vm.pageout_oom_seq: 120
> $

Without knowing vm.domain.0.stats.free_count it is hard to
tell, but you might try, say, sysctl vm.pageout_oom_seq=12000
in hopes of getting notably more time with the
vm.domain.0.stats.free_count staying small. That may give
you more time to notice the low free RAM (if you are checking
periodically, rather than just waiting for failure to make
it obvious).


===
Mark Millard
marklmi at yahoo.com