Belated out of swap kill on rpi3 at r359216
bob prohaska
fbsd at www.zefox.net
Wed Mar 25 01:56:35 UTC 2020
[this thread needs a haircut, new comments start near line 288]
On Tue, Mar 24, 2020 at 05:21:47PM -0700, Mark Millard wrote:
> On 2020-Mar-24, at 15:46, bob prohaska <fbsd at www.zefox.net> wrote:
>
> > The logfile being discussed is at
> > http://www.zefox.net/~fbsd/rpi3/swaptests/r359216/swapscript.osupdate.log
> > for convenient reference. More replies below.
> >
> >
> > On Tue, Mar 24, 2020 at 01:21:42PM -0700, Mark Millard wrote:
> >>
> >>
> >> On 2020-Mar-24, at 11:55, bob prohaska <fbsd at www.zefox.net> wrote:
> >>
> >>> On Tue, Mar 24, 2020 at 10:09:28AM -0700, Mark Millard wrote:
> >>>>
> >>>>
> >>>> On 2020-Mar-24, at 08:57, bob prohaska <fbsd at www.zefox.net> wrote:
> >>>>
> >>>>> An attempt to buildworld on an rpi3 running r359216 stopped with an
> >>>>> OOMA kill. Sources were at 359264, loader.conf contained
> >>>>> vm.pageout_oom_seq="4096" .
> >>>>
> >>>> What of the value of vm.pfault_oom_attempts ?
> >>>>
> >>>> If vm.pfault_oom_attempts was not -1,
> >>>> what of the value of vm.pfault_oom_wait ?
> >>>>
> >>> bob at www:/usr/src % sysctl vm.pfault_oom_wait
> >>> vm.pfault_oom_wait: 10
> >>> I didn't change it, this must be the default.
> >>> Would it be useful to add something like
> >>> vm.pfault_oom_wait="20"
> >>> to /boot/loader.conf?
> >>
> >> I assume that this answer means vm.pfault_oom_attempts
> >> was not -1. But you did not list what it was.
> >>
> >
> > Sorry, the variable names are running together in my head.
> >
> > bob at www:/usr/src % sysctl vm.pfault_oom_attempts
> > vm.pfault_oom_attempts: 3
> > It's now reset to -1, hopefully it'll work better than last time 8-)
> >
> >> vm.pfault_oom_attempts=-1 is the means of avoiding
> >> the the vm.pfault_oom_attempts,vm.pfault_oom_wait
> >> pair from causing OOM at all. (vm.pfault_oom_wait
> >> becomes irrelevant.)
> >>
> >
> >> When vm.pfault_oom_attempts != -1 , then there are
> >> potential time-outs that overall involve:
> >>
> >> vm.pfault_oom_attempts * vm.pfault_oom_wait
> >>
> >> but has some tradeoffs in the partitioning between
> >> the 2 factors:
> >>
> >> # sysctl -d vm.pfault_oom_attempts
> >> vm.pfault_oom_attempts: Number of page allocation attempts in page fault handler before it triggers OOM handling
> >>
> >> # sysctl -d vm.pfault_oom_wait
> >> vm.pfault_oom_wait: Number of seconds to wait for free pages before retrying the page fault handler
> >>
> >> As I understand, the following cases could each have
> >> very different results depending on what the latencies
> >> are like and such:
> >>
> >> vm.pfault_oom_attempts==20 && vm.pfault_oom_wait==1
> >> vs.
> >> vm.pfault_oom_attempts==1 && vm.pfault_oom_wait==20
> >> vs.
> >> vm.pfault_oom_attempts==4 && vm.pfault_oom_wait==5
> >>
> >> As I remember, Konstantin Belousov once asked someone
> >> that was having a repeatable problem to try some
> >> alternatives that explored this but, as I remember,
> >> he got no reply to the request.
> >>
> > Konstantin wrote to both me and the list in a thread on
> > Re: panic: non-current pmap. If it's relevant please indicate.
> >
> > He also wrote to Don Lewis, in a thread on Re: spurious out of swap kills
> > but that was on a
> > " 16-thread Ryzen machine, with 64 GB of RAM and 40 GB of swap"
> > which is surely unlike my predicament 8-)
> >
> >> The person might have just used: vm.pfault_oom_attempts=-1
> >> and continued with their primary activity, for all I know.
> >>
> >> vm.pfault_oom_attempts=-1 is only recommended when one
> >> is sure that they will not run out of swap/paging space,
> >> if I understand right.
> >>
> >> For reference, for other settings:
> >>
> >> # sysctl -d vm.pageout_oom_seq
> >> vm.pageout_oom_seq: back-to-back calls to oom detector to start OOM
> >>
> >> There is also:
> >>
> >> # sysctl -d vm.oom_pf_secs
> >> vm.oom_pf_secs:
> >>
> >> but it seems to have an empty description. May be
> >> it reports vm.pfault_oom_attempts * vm.pfault_oom_wait
> >> when vm.pfault_oom_attempts != -1 . (I've not looked.)
> >>
> >>>>> A snipped of gstat log suggests the worst congestion in the storage I/O
> >>>>> happened at Tue Mar 24 04:52:26 PDT 2020 with an L(q) of 37, but the
> >>>>> OOMA kill happened at Tue Mar 24 04:53:04 PDT 2020, by which time the
> >>>>> L(q) had dropped to one, half a minute later.
> >>>>>
> >>>>> Is the delay in OOMA action to be expected?
> >>>>>
> >>>>> Here's the relevant part of the log, I hope the columns display readably:
> >>>>>
> >>>>> 0/2/2/19177 mbuf clusters in use (current/cache/total/max)
> >>>>> procs memory page disks faults cpu
> >>>>> r b w avm fre flt re pi po fr sr mm0 da0 in sy cs us sy id
> >>>>> 0 0 29 1897320 47312 12851 9 4 5 13330 1669 0 0 14172 8020 3034 65 29 6
> >>>>> dT: 1.056s w: 1.000s
> >>>>> L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps ms/d %busy Name
> >>>>> 37 367 323 5463 6.4 45 1511 76.8 0 0 0.0 91.7 mmcsd0
> >>>>> 37 367 323 5463 6.5 45 1511 76.9 0 0 0.0 93.3 mmcsd0s2
> >>>>> 34 133 111 3209 7.6 22 697 134.7 0 0 0.0 74.0 mmcsd0s2a
> >>>>> 3 235 212 2254 5.9 23 814 21.5 0 0 0.0 70.0 mmcsd0s2b
> >>>>> 34 133 111 3209 7.6 22 697 134.7 0 0 0.0 74.2 ufs/rootfs
> >>>>> Tue Mar 24 04:52:26 PDT 2020
> >>>>> Device 1K-blocks Used Avail Capacity
> >>>>> /dev/mmcsd0s2b 4404252 224484 4179768 5%
> >>>>> Mar 24 04:20:50 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21842, size: 12288
> >>>>> Mar 24 04:20:50 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 25477, size: 4096
> >>>>
> >>>> How bad were things back when the "indefinate wait buffer" notices
> >>>> were generated (Mar 24 04:20:50 time frame)?
> >>>>
> >>> It looks like _new_ indefinite wait messages started coming at around Mon Mar 23 23:00:28 PDT 2020:
> >>> procs memory page disks faults cpu
> >>> r b w avm fre flt re pi po fr sr mm0 da0 in sy cs us sy id
> >>> 2 0 18 1588824 197676 14108 2 1 0 14759 238 0 0 14055 9263 2668 62 32 5
> >>> dT: 1.027s w: 1.000s
> >>> L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps ms/d %busy Name
> >>> 9 1 0 0 0.0 1 31 13451 0 0 0.0 326.9 mmcsd0
> >>> 9 1 0 0 0.0 1 31 13451 0 0 0.0 326.9 mmcsd0s2
> >>> 6 1 0 0 0.0 1 31 13451 0 0 0.0 326.9 mmcsd0s2a
> >>> 6 1 0 0 0.0 1 31 13451 0 0 0.0 326.9 ufs/rootfs
> >>> Mon Mar 23 23:00:28 PDT 2020
> >>> Device 1K-blocks Used Avail Capacity
> >>> /dev/mmcsd0s2b 4404252 245288 4158964 6%
> >>> Mar 23 23:00:28 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 280324, size: 8192
> >>> Mar 23 23:00:28 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 285903, size: 16384
> >>>
> >>> Admittedly, 300% busy looks pretty bad. Still the machine didn't quit.....
> >>> On the next sample %busy went down to 18% for swap, less for all else.
> >>
> >> As I understand, the ms/w value 13451 means that there was
> >> for a time (a mean of ?) somewhat under 14 seconds from a
> >> write being queued to it being completed.
> >>
> >> If a read could be blocked for such time frames by such
> >> writes, it suggests that vm.pfault_oom_wait might need to
> >> be larger than 13 if vm.pfault_oom_attempts=-1 is not in
> >> use.
> >>
> > Ok, I'm starting to get it. On this machine it's 10.
> >
> > But the 13 second delay appeared at Mon Mar 23 23:00:28 PDT 2020.
> > The kill occurred around Tue Mar 24 04:53:08 PDT 2020
> > under what look like much more benign circumstances.
>
> I had intended to ask about the text next-to/just-above,
> not much earlier [thus, the "(Mar 24 04:20:50 time frame)"].
> So I was asking about:
>
> Mar 24 04:20:50 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21842, size: 12288
> Mar 24 04:20:50 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 25477, size: 4096
>
> and did not originally notice that you replied with
> information for "Mon Mar 23 23:00:28 PDT 2020".
>
> Other than showing examples of what the delays can be,
> the "Mar 24 04:20:50" is of more direct relevance for
> the OOM kill that happened.
>
> I'll show the text around/before 04:20:50 later in this
> note. It proves somewhat interesting.
>
> >> Or vm.pfault_oom_attempts=-1 could be used to avoid large
> >> time frames from directly leading to OOM activity.
> >>
> > Understood. I simply forgot to restore the setting after
> > the initial troubles with it.
> >
> >>>>
> >>>> Below I show code changes to be more explicit in the
> >>>> output messaging about what contributes to initiating
> >>>> OOM kills, without needing boot verbose or the like.
> >>>> There are also some messages from Mark J.'s old code
> >>>> for reporting on things related to initiating OOM
> >>>> kills, or some minor variations of them.
> >>>>
> >>>
> >>>> You may want to try such changes to provide more
> >>>> context for your OOM kills when they happen. Below
> >>>> the 4 reasons reported are (showing the most
> >>>> detailed of the related messages from different
> >>>> stages):
> >>>>
> >>>> swp_pager_meta_build: swap blk uma zone exhausted
> >>>>
> >>>> swp_pager_meta_build: swap pctrie uma zone exhausted
> >>>>
> >>>> vm_fault_allocate: proc %d (%s) failed to alloc page on fault, starting OOM
> >>>>
> >>>> m_pageout_mightbe_oom: kill context: v_free_count: %u, v_inactive_count: %u
> >>>>
> >>>>
> >>>
> >>> Could inquiries instead be added to the logging script?
> >>> Right now it's
> >>> #!/bin/sh
> >>> while true
> >>> do vmstat ; gstat -abd -I 1s ; date ; swapinfo ; tail -n 2 /var/log/messages ; netstat -m | grep "mbuf clusters"
> >>> done
> >>>
> >>> I'd much rather tamper with the logging script than the kernel 8-)
> >>
> >> Unfortunately, the reason I made the kernel messaging
> >> changes is that, as far as I know, the normal kernel
> >> simply does not publish the information anywhere when
> >> the internal criteria leads to OOM activity: No
> >> inquiry available without kernel changes.
> >
> > If all else fails I'll try to apply your patches to the kernel
> > and recompile.
>
> Up to you, of course.
>
>
> The text before "Mar 24 04:20:50" messages, with notes
> and blank lines mixed in to group things:
>
> . . .
> Tue Mar 24 04:20:38 PDT 2020
> Device 1K-blocks Used Avail Capacity
> /dev/mmcsd0s2b 4404252 403784 4000468 9%
> Mar 24 02:50:01 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 623056, size: 16384
> Mar 24 04:17:17 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1033854, size: 4096
> 3/773/776/19177 mbuf clusters in use (current/cache/total/max)
>
> procs memory page disks faults cpu
> r b w avm fre flt re pi po fr sr mm0 da0 in sy cs us sy id
> 0 0 30 2034696 54180 12942 7 2 4 13432 1374 0 0 14074 8102 2849 65 29 5
> dT: 1.031s w: 1.000s
> L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps ms/d %busy Name
> 18 1 0 0 0.0 1 31 26686 0 0 0.0 287.1 mmcsd0
> 18 1 0 0 0.0 1 31 26686 0 0 0.0 287.1 mmcsd0s2
> 10 1 0 0 0.0 1 31 26686 0 0 0.0 287.1 mmcsd0s2a
> 10 1 0 0 0.0 1 31 26686 0 0 0.0 287.1 ufs/rootfs
> Tue Mar 24 04:20:40 PDT 2020
> Device 1K-blocks Used Avail Capacity
> /dev/mmcsd0s2b 4404252 403784 4000468 9%
> Mar 24 02:50:01 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 623056, size: 16384
> Mar 24 04:17:17 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1033854, size: 4096
> 6/770/776/19177 mbuf clusters in use (current/cache/total/max)
>
> Note that the above has 26sec+ ms/w figures. No reads.
> No mmcsd0s2b.
>
>
> procs memory page disks faults cpu
> r b w avm fre flt re pi po fr sr mm0 da0 in sy cs us sy id
> 0 0 30 2034696 54252 12942 7 2 4 13432 1374 0 0 14074 8102 2849 65 29 5
> dT: 1.064s w: 1.000s
> L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps ms/d %busy Name
> Tue Mar 24 04:20:42 PDT 2020
> Device 1K-blocks Used Avail Capacity
> /dev/mmcsd0s2b 4404252 403784 4000468 9%
> Mar 24 02:50:01 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 623056, size: 16384
> Mar 24 04:17:17 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1033854, size: 4096
> 9/1023/1032/19177 mbuf clusters in use (current/cache/total/max)
>
> For the above, note the lack of any lines with ms/r or ms/w figures,
> just the title rows show.
>
Same thing happens when gstat -a is run on an idle machine.
I thought it normal, the CPU had all the data it needed.
>
> procs memory page disks faults cpu
> r b w avm fre flt re pi po fr sr mm0 da0 in sy cs us sy id
> 0 0 30 2034696 53668 12942 7 2 4 13432 1374 0 0 14074 8102 2849 65 29 5
> dT: 1.046s w: 1.000s
> L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps ms/d %busy Name
> 17 1 0 0 0.0 1 31 30042 0 0 0.0 320.7 mmcsd0
> 17 1 0 0 0.0 1 31 30042 0 0 0.0 320.7 mmcsd0s2
> 9 1 0 0 0.0 1 31 30042 0 0 0.0 320.7 mmcsd0s2a
> 9 1 0 0 0.0 1 31 30042 0 0 0.0 320.7 ufs/rootfs
> Tue Mar 24 04:20:43 PDT 2020
> Device 1K-blocks Used Avail Capacity
> /dev/mmcsd0s2b 4404252 403784 4000468 9%
> Mar 24 02:50:01 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 623056, size: 16384
> Mar 24 04:17:17 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1033854, size: 4096
> 9/1023/1032/19177 mbuf clusters in use (current/cache/total/max)
>
> Note that the above has 30sec+ ms/w figures. No reads.
>
> Also note that the L(q) figures dropped by only 1 over
> the about 04:20:40 to 04:20:43 interval (18->17 and
> 10->9). (I only take these deltas as suggestive. They
> might be accidental near matches.) No mmcsd0s2b.
>
>
> procs memory page disks faults cpu
> r b w avm fre flt re pi po fr sr mm0 da0 in sy cs us sy id
> 0 0 30 2034696 53624 12942 7 2 4 13432 1374 0 0 14074 8102 2849 65 29 5
> dT: 1.003s w: 1.000s
> L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps ms/d %busy Name
> Tue Mar 24 04:20:45 PDT 2020
> Device 1K-blocks Used Avail Capacity
> /dev/mmcsd0s2b 4404252 403784 4000468 9%
> Mar 24 02:50:01 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 623056, size: 16384
> Mar 24 04:17:17 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1033854, size: 4096
> 12/1020/1032/19177 mbuf clusters in use (current/cache/total/max)
>
> For the above, note the lack of any lines with ms/r or ms/w figures,
> just the title rows show.
>
It's tempting, but maybe not correct, the think the I/O caught up.
>
> procs memory page disks faults cpu
> r b w avm fre flt re pi po fr sr mm0 da0 in sy cs us sy id
> 0 0 30 2034696 53600 12942 7 2 4 13432 1374 0 0 14074 8102 2849 65 29 5
> dT: 1.063s w: 1.000s
> L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps ms/d %busy Name
> Tue Mar 24 04:20:47 PDT 2020
> Device 1K-blocks Used Avail Capacity
> /dev/mmcsd0s2b 4404252 403784 4000468 9%
> Mar 24 02:50:01 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 623056, size: 16384
> Mar 24 04:17:17 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1033854, size: 4096
> 12/1020/1032/19177 mbuf clusters in use (current/cache/total/max)
>
> For the above, note the lack of any lines with ms/r or ms/w figures,
> just the title rows show.
>
Can't that be interpreted as gstat -a having nothing to report?
Just _why_ it might have nothing to report never crossed my mind....
>
>
> procs memory page disks faults cpu
> r b w avm fre flt re pi po fr sr mm0 da0 in sy cs us sy id
> 0 0 30 2034696 53592 12942 7 2 4 13432 1374 0 0 14074 8102 2849 65 29 5
> dT: 1.006s w: 1.000s
> L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps ms/d %busy Name
> Tue Mar 24 04:20:48 PDT 2020
> Device 1K-blocks Used Avail Capacity
> /dev/mmcsd0s2b 4404252 403784 4000468 9%
> Mar 24 02:50:01 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 623056, size: 16384
> Mar 24 04:17:17 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1033854, size: 4096
> 15/1017/1032/19177 mbuf clusters in use (current/cache/total/max)
>
> For the above, note the lack of any lines with ms/r or ms/w figures,
> just the title rows show.
>
>
>
> procs memory page disks faults cpu
> r b w avm fre flt re pi po fr sr mm0 da0 in sy cs us sy id
> 0 0 30 2034696 53672 12942 7 2 4 13432 1374 0 0 14074 8102 2849 65 29 5
> dT: 1.002s w: 1.000s
> L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps ms/d %busy Name
> Tue Mar 24 04:20:50 PDT 2020
> Device 1K-blocks Used Avail Capacity
> /dev/mmcsd0s2b 4404252 403784 4000468 9%
> Mar 24 02:50:01 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 623056, size: 16384
> Mar 24 04:17:17 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1033854, size: 4096
> 15/1017/1032/19177 mbuf clusters in use (current/cache/total/max)
>
> For the above, note the list of any lines with ms/r or ms/w figures,
> just the title rows.
>
>
>
> procs memory page disks faults cpu
> r b w avm fre flt re pi po fr sr mm0 da0 in sy cs us sy id
> 2 0 30 2045028 54160 12942 7 2 4 13432 1374 0 0 14074 8102 2849 65 29 5
> dT: 1.001s w: 1.000s
> L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps ms/d %busy Name
> 2 329 236 3839 13.0 93 1262 18.6 0 0 0.0 99.4 mmcsd0
> 2 329 236 3839 13.2 93 1262 18.7 0 0 0.0 101.1 mmcsd0s2
> 1 153 152 2816 12.5 1 32 366.5 0 0 0.0 95.2 mmcsd0s2a
> 1 176 84 1023 14.5 92 1230 14.9 0 0 0.0 66.8 mmcsd0s2b
> 1 153 152 2816 12.5 1 32 366.6 0 0 0.0 95.3 ufs/rootfs
> Tue Mar 24 04:20:51 PDT 2020
> Device 1K-blocks Used Avail Capacity
> /dev/mmcsd0s2b 4404252 296976 4107276 7%
> Mar 24 04:20:50 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21842, size: 12288
> Mar 24 04:20:50 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 25477, size: 4096
> 0/1032/1032/19177 mbuf clusters in use (current/cache/total/max)
>
> For the above, the lines with ms/r and ms/w figures are back
> but the indefinite wait buffer for "Mar 24 04:20:50" happened
> too, before the "Tue Mar 24 04:20:51 PDT 2020" above. I'd guess
> they are from before the data that the gstat output is based on.
>
> Also there are lots of reads and some writes in the above
> gstat output. mmcsd0s2b is again showing as well.
>
> Note: It looks like ufs/rootfs might be a label
> identifying mmcsd0s2a, so they track up to slight
> time-of-information differences. I'm guessing that
> mmcsd0s2b is just a swap/paging partition.
>
That's correct.
> So it looks like activity for ufs/rootfs may have
> blocked/delayed activity for paging/swaping
> (mmcsd0s2b), at least for a time, even if it is
> not directly the cause of the specific OOM
> activity.
>
How would one distinguish slow swap i/o from slow
filesystem i/o ?
Would moving swap to a USB device avoid blocking
by writes to microSD? That's not hard to try, but
it puts a whole new kettle of fish on the fire.
It should be added that the original buildworld has been restarted with
vm.pfault_oom_attempts: -1 and so far, though it still generates the
"indefinite wait" warnings, the ms/w times look better. It's not done yet,
though. How much overhead comes with OOMA?
Many thanks for your insights,
bob prohaska
More information about the freebsd-arm
mailing list