Re: ... was killed: a thread waited too long to allocate a page [actually: was killed: failed to reclaim memory problem]
- Reply: Tomoaki AOKI : "Re: ... was killed: a thread waited too long to allocate a page [actually: was killed: failed to reclaim memory problem]"
- Reply: Karl Pielorz : "Re: ... was killed: a thread waited too long to allocate a page [actually: was killed: failed to reclaim memory problem]"
- Reply: Karl Pielorz : "Re: ... was killed: a thread waited too long to allocate a page [actually: was killed: failed to reclaim memory problem]"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Thu, 01 Feb 2024 16:30:19 UTC
Karl Pielorz <kpielorz_lst_at_tdx.co.uk> wrote on Date: Thu, 01 Feb 2024 14:47:44 UTC : > --On 28 December 2023 11:38 +0200 Daniel Braniss <danny@cs.huji.ac.il> > wrote: > > > hi, > > I'm running 13.2 Stable on this particular host, which has about 200TB of > > zfs storage the host also has some 132Gb of memory, > > lately, mountd is getting killed: > > kernel: pid 3212 (mountd), jid 0, uid 0, was killed: a thread waited > > too long to allocate a page > > > > rpcinfo shows it's still there, but > > service mountd restart > > fails. > > > > only solution is to reboot. > > BTW, the only 'heavy' stuff that I can see are several rsync > > processes. > > Hi, > > I seem to have run into something similar. I recently upgraded a 12.4 box > to 13.2p9. The box has 32G of RAM, and runs ZFS. We do a lot of rsync work > to it monthly - the first month we've done this with 13.2p9 we get a lot of > processes killed, all with a similar (but not identical) message, e.g. > > pid 11103 (ssh), jid 0, uid 0, was killed: failed to reclaim memory > pid 10972 (local-unbound), jid 0, uid 59, was killed: failed to reclaim > memory > pid 3223 (snmpd), jid 0, uid 0, was killed: failed to reclaim memory > pid 3243 (mountd), jid 0, uid 0, was killed: failed to reclaim memory > pid 3251 (nfsd), jid 0, uid 0, was killed: failed to reclaim memory > pid 10996 (sshd), jid 0, uid 0, was killed: failed to reclaim memory > pid 3257 (sendmail), jid 0, uid 0, was killed: failed to reclaim memory > pid 8562 (csh), jid 0, uid 0, was killed: failed to reclaim memory > pid 3363 (smartd), jid 0, uid 0, was killed: failed to reclaim memory > pid 8558 (csh), jid 0, uid 0, was killed: failed to reclaim memory > pid 3179 (ntpd), jid 0, uid 0, was killed: failed to reclaim memory > pid 8555 (tcsh), jid 0, uid 1001, was killed: failed to reclaim memory > pid 3260 (sendmail), jid 0, uid 25, was killed: failed to reclaim memory > pid 2806 (devd), jid 0, uid 0, was killed: failed to reclaim memory > pid 3156 (rpcbind), jid 0, uid 0, was killed: failed to reclaim memory > pid 3252 (nfsd), jid 0, uid 0, was killed: failed to reclaim memory > pid 3377 (getty), jid 0, uid 0, was killed: failed to reclaim memory > > This 'looks' like 'out of RAM' type situation - but at the time, top showed: > > last pid: 12622; load averages: 0.10, 0.24, 0.13 > > 7 processes: 1 running, 6 sleeping > CPU: 0.1% user, 0.0% nice, 0.2% system, 0.0% interrupt, 99.7% idle > Mem: 4324K Active, 8856K Inact, 244K Laundry, 24G Wired, 648M Buf, 7430M > Free > ARC: 20G Total, 8771M MFU, 10G MRU, 2432K Anon, 161M Header, 920M Other > 15G Compressed, 23G Uncompressed, 1.59:1 Ratio > Swap: 8192M Total, 5296K Used, 8187M Free > > Rebooting it recovers it, and it completed the rsync after the reboot - > which left us with: > > last pid: 12570; load averages: 0.07, 0.14, 0.17 > up 0+00:15:06 14:43:56 > 26 processes: 1 running, 25 sleeping > CPU: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle > Mem: 39M Active, 5640K Inact, 17G Wired, 42M Buf, 14G Free > ARC: 15G Total, 33M MFU, 15G MRU, 130K Anon, 32M Header, 138M Other > 14G Compressed, 15G Uncompressed, 1.03:1 Ratio > Swap: 8192M Total, 8192M Free > > > I've not seen any bug reports along this line, in fact very little coverage > at all of the specific error. > > My only thought is to set a sysctl to limit ZFS ARC usage, i.e. to leave > more free RAM floating around the system. During the rsync it was > 'swapping' occasionally (few K in, few K out) - but it never ran out of > swap that I saw - and it certainly didn't look like an complete out of > memory scenario/box (which is what it felt like with everything getting > killed). One direction of control is . . . What do you have for ( copied from my /boot/loader.conf ): # # Delay when persistent low free RAM leads to # Out Of Memory killing of processes: vm.pageout_oom_seq=120 The default is 12 (last I knew, anyway). The 120 figure has allowed me and others to do buildworld, buildkernel, and poudriere bulk runs on small arm boards using all cores that otherwise got "failed to reclaim memory" (to use the modern, improved [not misleading] message text). Similarly for others that had other kinds of contexts that got the message. (The units for the 120 are not time units: more like a number of (re)tries to gain at least a target amount of Free RAM before failure handling starts. The comment wording is based on a consequence of the assignment.) The 120 is not a maximum, just a figure that has proved useful in various contexts. But see the notes below based as well. Notes: "failed to reclaim memory" can happen even with swap space enabled but no swap in use: sufficiently active pages are just not paged out to swap space so if most non-wired pages are classified as active, the kills can start. (There are some other parameters of possible use for some other modern "was killed" reason texts.) Wired pages are pages that can not be swapped out, even if classified as inactive. Your report indicates: 24G Wired with 20G of that being from ARC use. This likely was after some processes had already been killed. So likely more was wired and less was free at the start of the kills. That 24G+ of wired meant that only 8GiBytes- were available everything else. Avoiding that by limiting the ARC (tuning ZFS) or adjusting how the work load is spread over time or some combination also looks appropriate. I've no clue why ARC use would be signifcantly different for 12.4 vs. 13.2p9 . === Mark Millard marklmi at yahoo.com