Re: swap_pager: cannot allocate bio
- Reply: Chris Ross : "Re: swap_pager: cannot allocate bio"
- Reply: Mark Johnston : "Re: swap_pager: cannot allocate bio"
- In reply to: Mark Johnston : "Re: swap_pager: cannot allocate bio"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Mon, 15 Nov 2021 15:08:29 UTC
On 15/11/2021 16:50, Mark Johnston wrote: > On Mon, Nov 15, 2021 at 04:20:26PM +0200, Andriy Gapon wrote: >> On 15/11/2021 05:26, Chris Ross wrote: >>> A procstat -kka output is available (208kb of text, 1441 lines) at >>> https://pastebin.com/SvDcvRvb >> >> 67 100542 pagedaemon dom0 mi_switch+0xc1 >> _cv_wait+0xf2 arc_wait_for_eviction+0x1df arc_lowmem+0xca >> vm_pageout_worker+0x3c4 vm_pageout+0x1d7 fork_exit+0x8a fork_trampoline+0xe >> >> I was always of an opinion that waiting for the ARC reclaim in arc_lowmem was >> wrong. This shows an example of why it is so. >> >>> An ssh of a top command completed and shows: >>> >>> last pid: 91551; load averages: 0.00, 0.02, 0.30 up 2+00:19:33 22:23:15 >>> 40 processes: 1 running, 38 sleeping, 1 zombie >>> CPU: 3.9% user, 0.0% nice, 0.9% system, 0.0% interrupt, 95.2% idle >>> Mem: 58G Active, 210M Inact, 1989M Laundry, 52G Wired, 1427M Buf, 12G Free >> >> To me it looks like there is still plenty of free memory. >> >> I am not sure why vm_wait_domain (called by vm_page_alloc_noobj_domain) is not >> waking up. > > It's a deadlock: the page daemon is sleeping on the arc evict thread, > and the arc evict thread is waiting for memory: My point was that waiting for the free memory was not strictly needed yet given 12G free, but that's kind of obvious. > 2561 100722 zfskern arc_evict > mi_switch+0xc1 _sleep+0x1cb vm_wait_doms+0xe2 vm_wait_domain+0x51 > vm_page_alloc_noobj_domain+0x184 uma_small_alloc+0x62 keg_alloc_slab+0xb0 > zone_import+0xee zone_alloc_item+0x6f arc_evict_state+0x81 arc_evict_cb+0x483 > zthr_procedure+0xba fork_exit+0x8a fork_trampoline+0xe > > I presume this is from the marker allocations in arc_evict_state(). > > The second problem is that UMA is refusing to try to allocate from the > "wrong" NUMA domain, but that policy seems overly strict. Fixing that > alone would make the problem harder to hit, but I think it wouldn't > solve it completely. Yes, I propose to remove the wait for ARC evictions from arc_lowmem(). Another thing that may help a bit is having a greater "slack" between a threshold where the page daemon starts paging out and a threshold where memory allocations start to wait (via vm_wait_domain). Also, I think that for a long time we had a problem (but not sure if it's still present) where allocations succeeded without waiting until the free memory went below certain threshold M, but once a thread started waiting in vm_wait it would not be woken up until the free memory went above another threshold N. And the problem was that N >> M. In other words, a lot of memory had to be freed (and not grabbed by other threads) before the waiting thread would be woken up. >> Perhaps this is some sort of a NUMA related issue where one memory domain is >> exhausted while other(s) still have a lot of memory. >> Or maybe it's something else but it must be some sort of a bug. >> >>> ARC: 48G Total, 10G MFU, 38G MRU, 128K Anon, 106M Header, 23M Other >>> 46G Compressed, 46G Uncompressed, 1.00:1 Ratio >>> Swap: 425G Total, 3487M Used, 422G Free -- Andriy Gapon