Re: ... was killed: a thread waited too long to allocate a page [actually: was killed: failed to reclaim memory problem]
Date: Thu, 01 Feb 2024 18:26:48 UTC
On Thu, 1 Feb 2024 08:30:19 -0800 Mark Millard <marklmi@yahoo.com> wrote: > Karl Pielorz <kpielorz_lst_at_tdx.co.uk> wrote on > Date: Thu, 01 Feb 2024 14:47:44 UTC : > > > --On 28 December 2023 11:38 +0200 Daniel Braniss <danny@cs.huji.ac.il> > > wrote: > > > > > hi, > > > I'm running 13.2 Stable on this particular host, which has about 200TB of > > > zfs storage the host also has some 132Gb of memory, > > > lately, mountd is getting killed: > > > kernel: pid 3212 (mountd), jid 0, uid 0, was killed: a thread waited > > > too long to allocate a page > > > > > > rpcinfo shows it's still there, but > > > service mountd restart > > > fails. > > > > > > only solution is to reboot. > > > BTW, the only 'heavy' stuff that I can see are several rsync > > > processes. > > > > Hi, > > > > I seem to have run into something similar. I recently upgraded a 12.4 box > > to 13.2p9. The box has 32G of RAM, and runs ZFS. We do a lot of rsync work > > to it monthly - the first month we've done this with 13.2p9 we get a lot of > > processes killed, all with a similar (but not identical) message, e.g. > > > > pid 11103 (ssh), jid 0, uid 0, was killed: failed to reclaim memory > > pid 10972 (local-unbound), jid 0, uid 59, was killed: failed to reclaim > > memory > > pid 3223 (snmpd), jid 0, uid 0, was killed: failed to reclaim memory > > pid 3243 (mountd), jid 0, uid 0, was killed: failed to reclaim memory > > pid 3251 (nfsd), jid 0, uid 0, was killed: failed to reclaim memory > > pid 10996 (sshd), jid 0, uid 0, was killed: failed to reclaim memory > > pid 3257 (sendmail), jid 0, uid 0, was killed: failed to reclaim memory > > pid 8562 (csh), jid 0, uid 0, was killed: failed to reclaim memory > > pid 3363 (smartd), jid 0, uid 0, was killed: failed to reclaim memory > > pid 8558 (csh), jid 0, uid 0, was killed: failed to reclaim memory > > pid 3179 (ntpd), jid 0, uid 0, was killed: failed to reclaim memory > > pid 8555 (tcsh), jid 0, uid 1001, was killed: failed to reclaim memory > > pid 3260 (sendmail), jid 0, uid 25, was killed: failed to reclaim memory > > pid 2806 (devd), jid 0, uid 0, was killed: failed to reclaim memory > > pid 3156 (rpcbind), jid 0, uid 0, was killed: failed to reclaim memory > > pid 3252 (nfsd), jid 0, uid 0, was killed: failed to reclaim memory > > pid 3377 (getty), jid 0, uid 0, was killed: failed to reclaim memory > > > > This 'looks' like 'out of RAM' type situation - but at the time, top showed: > > > > last pid: 12622; load averages: 0.10, 0.24, 0.13 > > > > 7 processes: 1 running, 6 sleeping > > CPU: 0.1% user, 0.0% nice, 0.2% system, 0.0% interrupt, 99.7% idle > > Mem: 4324K Active, 8856K Inact, 244K Laundry, 24G Wired, 648M Buf, 7430M > > Free > > ARC: 20G Total, 8771M MFU, 10G MRU, 2432K Anon, 161M Header, 920M Other > > 15G Compressed, 23G Uncompressed, 1.59:1 Ratio > > Swap: 8192M Total, 5296K Used, 8187M Free > > > > Rebooting it recovers it, and it completed the rsync after the reboot - > > which left us with: > > > > last pid: 12570; load averages: 0.07, 0.14, 0.17 > > up 0+00:15:06 14:43:56 > > 26 processes: 1 running, 25 sleeping > > CPU: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle > > Mem: 39M Active, 5640K Inact, 17G Wired, 42M Buf, 14G Free > > ARC: 15G Total, 33M MFU, 15G MRU, 130K Anon, 32M Header, 138M Other > > 14G Compressed, 15G Uncompressed, 1.03:1 Ratio > > Swap: 8192M Total, 8192M Free > > > > > > I've not seen any bug reports along this line, in fact very little coverage > > at all of the specific error. > > > > My only thought is to set a sysctl to limit ZFS ARC usage, i.e. to leave > > more free RAM floating around the system. During the rsync it was > > 'swapping' occasionally (few K in, few K out) - but it never ran out of > > swap that I saw - and it certainly didn't look like an complete out of > > memory scenario/box (which is what it felt like with everything getting > > killed). > > > One direction of control is . . . > > What do you have for ( copied from my /boot/loader.conf ): > > # > # Delay when persistent low free RAM leads to > # Out Of Memory killing of processes: > vm.pageout_oom_seq=120 > > The default is 12 (last I knew, anyway). > > The 120 figure has allowed me and others to do buildworld, > buildkernel, and poudriere bulk runs on small arm boards > using all cores that otherwise got "failed to reclaim > memory" (to use the modern, improved [not misleading] > message text). Similarly for others that had other kinds > of contexts that got the message. > > (The units for the 120 are not time units: more like a > number of (re)tries to gain at least a target amount of > Free RAM before failure handling starts. The comment > wording is based on a consequence of the assignment.) > > The 120 is not a maximum, just a figure that has proved > useful in various contexts. > > But see the notes below based as well. > > Notes: > > "failed to reclaim memory" can happen even with swap > space enabled but no swap in use: sufficiently active > pages are just not paged out to swap space so if most > non-wired pages are classified as active, the kills > can start. > > (There are some other parameters of possible use for some > other modern "was killed" reason texts.) > > Wired pages are pages that can not be swapped out, even > if classified as inactive. > > Your report indicates: 24G Wired with 20G of that being > from ARC use. This likely was after some processes had > already been killed. So likely more was wired and less > was free at the start of the kills. > > That 24G+ of wired meant that only 8GiBytes- were > available everything else. Avoiding that by limiting > the ARC (tuning ZFS) or adjusting how the work load > is spread over time or some combination also looks > appropriate. > > I've no clue why ARC use would be signifcantly > different for 12.4 vs. 13.2p9 . > > === > Mark Millard > marklmi at yahoo.com Possibly not related, but codebase of ZFS on FreeBSD is switched from legacy one (aka ZoF) to OpenZFS on 13.0. -- Tomoaki AOKI <junchoon@dec.sakura.ne.jp>