From nobody Sat Nov 20 18:09:46 2021 X-Original-To: freebsd-fs@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 6A90318971C9 for ; Sat, 20 Nov 2021 18:09:56 +0000 (UTC) (envelope-from markjdb@gmail.com) Received: from mail-qk1-x72e.google.com (mail-qk1-x72e.google.com [IPv6:2607:f8b0:4864:20::72e]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4HxM4z2nlLz4Z4m; Sat, 20 Nov 2021 18:09:55 +0000 (UTC) (envelope-from markjdb@gmail.com) Received: by mail-qk1-x72e.google.com with SMTP id de30so13699416qkb.0; Sat, 20 Nov 2021 10:09:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=+o2pFesPJpp4AnQfTAtsDB9QS3ApgSCCUNlq3ceTKnQ=; b=J5WN8pgbXG+q7HR3p1cyMugJgy4UvtaD9MyEINNf9QUGsrOMlHaFhjWttz9YoHDj0m LbGMsVmlb+IDgOMF7BTYPruxS7/tfaB9m5B16qPQHV3zOspyLKx0bxkaEu7HSL/DBT2p AtQB12h8UsSYlxB46HHUYGGMJ8NW6WVmcbARbBa04Wm0FgNvELvrMn7+Xxodos/kwTGx J4vqBiqhuGuxmE+XE6FkmNVuYCl7qJ5uVk8DR21PYJB0135R0mKIvjDAm6hDCkVghkPM 5cd1jg8+uTTFP0yBoiIIcuPEXh+8TR9hCWua3v8i4OzuP4/Yay32P8x9HPXIghLPJ3On eksA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:sender:date:from:to:cc:subject:message-id :references:mime-version:content-disposition:in-reply-to; bh=+o2pFesPJpp4AnQfTAtsDB9QS3ApgSCCUNlq3ceTKnQ=; b=c4akP0PSn6WVQ36LNt8mr/g6ztVeZ8i05KuhaumuF2xmt0gJ4a47BxORvDEPEQiO+O 6DJJrOURsYVU30eLy1YfsEDSXwT97jMa6KgHxGY/PgjLlK0GDndAWRqWlnWrl08v0ZeH iyDXpkk0AfqwnsWPzMPhwCvrx21B5yVosCxEMwzjFW9yu0HugepMVbo5fLRDnyzk4+jV nUwSQ3C5OWyiftDXWQMDbbX8cT4TbCNgr8JtVwVY5Q7AJN7QhcA7i7af8X7UywkH+cYa CPxap7GOZ6HxTmbUsYsOGXq2UVBmmVM0QBV6ND08onI4DgWUB73aR2T0FeiPtJ60MGNo 1Lwg== X-Gm-Message-State: AOAM5300rUsP3So9FL4HWttZVTeFprYVxyTpyF9NPuuC78AXRk3Cy364 4+16vEqK+P9sKSEDuARaI7SimpNUimg= X-Google-Smtp-Source: ABdhPJyCjy43jReSSZrh2RybnsN0JVo1wAKLrtQrp5A/CytgAtS9aqhWHeGvPVN8S0PM/t0Lk4eOJA== X-Received: by 2002:a05:620a:1981:: with SMTP id bm1mr36135413qkb.113.1637431788905; Sat, 20 Nov 2021 10:09:48 -0800 (PST) Received: from nuc ([142.126.186.191]) by smtp.gmail.com with ESMTPSA id z8sm1796678qta.50.2021.11.20.10.09.48 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 20 Nov 2021 10:09:48 -0800 (PST) Date: Sat, 20 Nov 2021 13:09:46 -0500 From: Mark Johnston To: Andriy Gapon Cc: Chris Ross , freebsd-fs Subject: Re: swap_pager: cannot allocate bio Message-ID: References: <9FE99EEF-37C5-43D1-AC9D-17F3EDA19606@distal.com> <09989390-FED9-45A6-A866-4605D3766DFE@distal.com> <4E5511DF-B163-4928-9CC3-22755683999E@distal.com> <19A3AAF6-149B-4A3C-8C27-4CFF22382014@distal.com> <6DA63618-F0E9-48EC-AB57-3C3C102BC0C0@distal.com> <35c14795-3b1c-9315-8e9b-a8dfad575a04@FreeBSD.org> List-Id: Filesystems List-Archive: https://lists.freebsd.org/archives/freebsd-fs List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-fs@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Queue-Id: 4HxM4z2nlLz4Z4m X-Spamd-Bar: / Authentication-Results: mx1.freebsd.org; dkim=pass header.d=gmail.com header.s=20210112 header.b=J5WN8pgb; dmarc=none; spf=pass (mx1.freebsd.org: domain of markjdb@gmail.com designates 2607:f8b0:4864:20::72e as permitted sender) smtp.mailfrom=markjdb@gmail.com X-Spamd-Result: default: False [-0.20 / 15.00]; RCVD_VIA_SMTP_AUTH(0.00)[]; ARC_NA(0.00)[]; R_DKIM_ALLOW(-0.20)[gmail.com:s=20210112]; NEURAL_HAM_MEDIUM(-0.52)[-0.523]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[3]; R_SPF_ALLOW(-0.20)[+ip6:2607:f8b0:4000::/36]; TAGGED_RCPT(0.00)[freebsd]; MIME_GOOD(-0.10)[text/plain]; DMARC_NA(0.00)[freebsd.org]; RCVD_COUNT_THREE(0.00)[3]; TO_MATCH_ENVRCPT_SOME(0.00)[]; TO_DN_ALL(0.00)[]; DKIM_TRACE(0.00)[gmail.com:+]; NEURAL_SPAM_LONG(1.00)[1.000]; RCVD_IN_DNSWL_NONE(0.00)[2607:f8b0:4864:20::72e:from]; NEURAL_HAM_SHORT(-0.98)[-0.980]; MID_RHS_NOT_FQDN(0.50)[]; FORGED_SENDER(0.30)[markj@freebsd.org,markjdb@gmail.com]; MIME_TRACE(0.00)[0:+]; FREEMAIL_ENVFROM(0.00)[gmail.com]; ASN(0.00)[asn:15169, ipnet:2607:f8b0::/32, country:US]; FROM_NEQ_ENVFROM(0.00)[markj@freebsd.org,markjdb@gmail.com]; RCVD_TLS_ALL(0.00)[]; DWL_DNSWL_NONE(0.00)[gmail.com:dkim] X-ThisMailContainsUnwantedMimeParts: N On Mon, Nov 15, 2021 at 05:08:29PM +0200, Andriy Gapon wrote: > On 15/11/2021 16:50, Mark Johnston wrote: > > On Mon, Nov 15, 2021 at 04:20:26PM +0200, Andriy Gapon wrote: > >> On 15/11/2021 05:26, Chris Ross wrote: > >>> A procstat -kka output is available (208kb of text, 1441 lines) at > >>> https://pastebin.com/SvDcvRvb > >> > >> 67 100542 pagedaemon dom0 mi_switch+0xc1 > >> _cv_wait+0xf2 arc_wait_for_eviction+0x1df arc_lowmem+0xca > >> vm_pageout_worker+0x3c4 vm_pageout+0x1d7 fork_exit+0x8a fork_trampoline+0xe > >> > >> I was always of an opinion that waiting for the ARC reclaim in arc_lowmem was > >> wrong. This shows an example of why it is so. > >> > >>> An ssh of a top command completed and shows: > >>> > >>> last pid: 91551; load averages: 0.00, 0.02, 0.30 up 2+00:19:33 22:23:15 > >>> 40 processes: 1 running, 38 sleeping, 1 zombie > >>> CPU: 3.9% user, 0.0% nice, 0.9% system, 0.0% interrupt, 95.2% idle > >>> Mem: 58G Active, 210M Inact, 1989M Laundry, 52G Wired, 1427M Buf, 12G Free > >> > >> To me it looks like there is still plenty of free memory. > >> > >> I am not sure why vm_wait_domain (called by vm_page_alloc_noobj_domain) is not > >> waking up. > > > > It's a deadlock: the page daemon is sleeping on the arc evict thread, > > and the arc evict thread is waiting for memory: > > My point was that waiting for the free memory was not strictly needed yet given > 12G free, but that's kind of obvious. > > > 2561 100722 zfskern arc_evict > > mi_switch+0xc1 _sleep+0x1cb vm_wait_doms+0xe2 vm_wait_domain+0x51 > > vm_page_alloc_noobj_domain+0x184 uma_small_alloc+0x62 keg_alloc_slab+0xb0 > > zone_import+0xee zone_alloc_item+0x6f arc_evict_state+0x81 arc_evict_cb+0x483 > > zthr_procedure+0xba fork_exit+0x8a fork_trampoline+0xe > > > > I presume this is from the marker allocations in arc_evict_state(). > > > > The second problem is that UMA is refusing to try to allocate from the > > "wrong" NUMA domain, but that policy seems overly strict. Fixing that > > alone would make the problem harder to hit, but I think it wouldn't > > solve it completely. > > Yes, I propose to remove the wait for ARC evictions from arc_lowmem(). The problem with this is that the page daemon won't account for ARC evictions when reclaiming memory from the page queues. We need to also generalize the vm_lowmem eventhandler so that either - caches can promise to free N pages and then shrink themselves asynchronously, or - the page daemon can ask the cache to free N pages, where N is derived from the ratio of the cache size to the total amount of RAM, and the cache can be shrunk asynchronously. I'm not sure how easy it is to get this information from the ARC. > Another thing that may help a bit is having a greater "slack" between a > threshold where the page daemon starts paging out and a threshold where memory > allocations start to wait (via vm_wait_domain). > > Also, I think that for a long time we had a problem (but not sure if it's still > present) where allocations succeeded without waiting until the free memory went > below certain threshold M, but once a thread started waiting in vm_wait it would > not be woken up until the free memory went above another threshold N. And the > problem was that N >> M. In other words, a lot of memory had to be freed (and > not grabbed by other threads) before the waiting thread would be woken up. This is perhaps still an issue, though maybe not as noticeable now that the page daemon runs more frequently and will set its target based on the recent history of the current page shortage, rather than using static high/low watermark thresholds.