From nobody Wed May 19 20:28:51 2021 X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 2B6F38C538B for ; Wed, 19 May 2021 20:29:05 +0000 (UTC) (envelope-from asomers@gmail.com) Received: from mail-oi1-f175.google.com (mail-oi1-f175.google.com [209.85.167.175]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1O1" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4Flkwx03qrz3FLq; Wed, 19 May 2021 20:29:04 +0000 (UTC) (envelope-from asomers@gmail.com) Received: by mail-oi1-f175.google.com with SMTP id c3so14290625oic.8; Wed, 19 May 2021 13:29:04 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=88dr5bH/oaFZejcXYHMyGE9owqo8OxQF7Fq01SX+Z4g=; b=spCX9/WUpApX/QcYaqHBU6jfj6TXTZmYb7lfu36VvMrTzBirDxci9TEq/3xJ519MBc lip22LHja0a5rqph/gLRdMyqXzMIW/7cTd/XTiiZScI0O+W0rkGRCoJKsF/AgoAxcX3C PSzpJ6p8m+8ySp0QPtJoIgIXYy4rTBzaG5i9w4ma51gRr0vdtQMAoJ50b1semvhHdgUY bSSEfUFTbCo4qhoJTofu1VoBCT/vxDZh4bVj3SRQohLprSVtwi5X4qLJaSLEBHmSfbmO mgMzLJzY6+1oZKcB3ClXsJlWQkYIR3iZ0xsA0pQPgA2kfcrp3bfJp/ojnNGTxc2+eqaX AhBA== X-Gm-Message-State: AOAM533Y/EdTl1gy0uculU+T5u8VkjKLTnkSCF4INbj/rpnEox2oX+iI 2Eeg8OV34CCiG7l2ATGc3KGPGKLbx3DcFj9LeJh2YW3JuTA= X-Google-Smtp-Source: ABdhPJz1Vaud900ntbfXsjzd6/Y6k4lVMRRb5Knu3au7Zp4bGoAKErzu5n1+CAw8mTXExj7/x3SJg7FFYCEuwto83l8= X-Received: by 2002:a05:6808:8c6:: with SMTP id k6mr923406oij.55.1621456143096; Wed, 19 May 2021 13:29:03 -0700 (PDT) List-Id: Technical discussions relating to FreeBSD List-Archive: https://lists.freebsd.org/archives/freebsd-hackers List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-hackers@freebsd.org MIME-Version: 1.0 References: In-Reply-To: From: Alan Somers Date: Wed, 19 May 2021 14:28:51 -0600 Message-ID: Subject: Re: The pagedaemon evicts ARC before scanning the inactive page list To: Konstantin Belousov Cc: Mark Johnston , FreeBSD Hackers Content-Type: multipart/alternative; boundary="0000000000006da17e05c2b4ad3b" X-Rspamd-Queue-Id: 4Flkwx03qrz3FLq X-Spamd-Bar: ---- X-Spam: yes Authentication-Results: mx1.freebsd.org; none X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[] --0000000000006da17e05c2b4ad3b Content-Type: text/plain; charset="UTF-8" On Tue, May 18, 2021 at 10:17 PM Konstantin Belousov wrote: > On Tue, May 18, 2021 at 09:55:25PM -0600, Alan Somers wrote: > > On Tue, May 18, 2021 at 9:25 PM Konstantin Belousov > > > > Is your machine ZFS-only? If yes, then typical source of inactive > memory > > > can be of two kinds: > > > > > > > No, there is also FUSE. But there is typically < 1GB of Buf memory, so I > > didn't mention it. > As Mark mentioned, buffers use page cache as second-level cache. More > precisely, there is relatively limited number of buffers in the system, > which are just headers to describe a set of pages. When a buffer is > recycled, its pages are put on inactive queue. > > This is why I asked is your machine ZFS-only or not, because io on > bufcache-using filesystems typically add to the inactive queue. > > > > > > > > - anonymous memory that apps allocate with facilities like malloc(3). > > > If inactive is shrinkable then it is probably not, because dirty > pages > > > from anon objects must go through laundry->swap route to get evicted, > > > and you did not mentioned swapping > > > > > > > No, there's no appreciable amount of swapping going on. Nor is the > laundry > > list typically more than a few hundred MB. > > > > > > > - double-copy pages cached in v_objects of ZFS vnodes, clean or dirty. > > > If unmapped, these are mostly a waste. Even if mapped, the source > > > of truth for data is ARC, AFAIU, so they can be dropped as well, > since > > > inactive state means that its content is not hot. > > > > > > > So if a process mmap()'s a file on ZFS and reads from it but never writes > > to it, will those pages show up as inactive? > It depends on workload, and it does not matter much if the pages are clean > or dirty. Right after mapping or under intense access pattern, they sit > on the active list. If not touched long enough, or cycled through the > buffer cache for io (but ZFS pages not go through buffer cache), they > are moved to inactive. > > > > > > > > > > > You can try to inspect the most outstanding objects adding to the > > > inactive queue with 'vmobject -o' to see where the most of inactive > pages > > > come from. > > > > > > > Wow, that did it! About 99% of the inactive pages come from just a few > > vnodes which are used by the FUSE servers. But I also see a few large > > entries like > > 1105308 333933 771375 1 0 WB df > > what does that signify? > These are anonymous memory. > > > > > > > > > > > If indeed they are double-copy, then perhaps ZFS can react even to the > > > current primitive vm_lowmem signal somewhat different. First, it could > > > do the pass over its vnodes and > > > - free clean unmapped pages > > > - if some targets are not met after that, laundry dirty pages, > > > then return to freeing clean unmapped pages > > > all that before ever touching its cache (ARC). > > > > Follow-up: All of the big inactive-memory consumers were files on FUSE file systems that were being exported as CTL LUNs. ZFS files exported by CTL do not use any res or inactive memory. I didn't test UFS. Curiously, removing the LUN does not free the memory, but shutting down the FUSE daemon does. A valid workaround is to set the vfs.fusefs.data_cache_mode sysctl to 0. That prevents the kernel from caching any data from the FUSE file system. I've tested this on both FreeBSD 12.2 and 13.0 . Should the kernel do a better job of reclaiming inactive memory before ARC? Yes, but in my case it's better not to create so much inactive memory in the first place. Thanks for everybody's help, especially kib's tip about "vmstat -o". -Alan --0000000000006da17e05c2b4ad3b Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
On Tue, May 18, 2021 at 10:17 PM Konstantin Belousov <kostikbel@gmail.com> wrote:
On Tue, May 18, 2021 at 09:5= 5:25PM -0600, Alan Somers wrote:
> On Tue, May 18, 2021 at 9:25 PM Konstantin Belousov <kostikbel@gmail.com>
> > Is your machine ZFS-only?=C2=A0 If yes, then typical source of in= active memory
> > can be of two kinds:
> >
>
> No, there is also FUSE.=C2=A0 But there is typically < 1GB of Buf m= emory, so I
> didn't mention it.
As Mark mentioned, buffers use page cache as second-level cache. More
precisely, there is relatively limited number of buffers in the system,
which are just headers to describe a set of pages. When a buffer is
recycled, its pages are put on inactive queue.

This is why I asked is your machine ZFS-only or not, because io on
bufcache-using filesystems typically add to the inactive queue.

>
>
> > - anonymous memory that apps allocate with facilities like malloc= (3).
> >=C2=A0 =C2=A0If inactive is shrinkable then it is probably not, be= cause dirty pages
> >=C2=A0 =C2=A0from anon objects must go through laundry->swap ro= ute to get evicted,
> >=C2=A0 =C2=A0and you did not mentioned swapping
> >
>
> No, there's no appreciable amount of swapping going on.=C2=A0 Nor = is the laundry
> list typically more than a few hundred MB.
>
>
> > - double-copy pages cached in v_objects of ZFS vnodes, clean or d= irty.
> >=C2=A0 =C2=A0If unmapped, these are mostly a waste.=C2=A0 Even if = mapped, the source
> >=C2=A0 =C2=A0of truth for data is ARC, AFAIU, so they can be dropp= ed as well, since
> >=C2=A0 =C2=A0inactive state means that its content is not hot.
> >
>
> So if a process mmap()'s a file on ZFS and reads from it but never= writes
> to it, will those pages show up as inactive?
It depends on workload, and it does not matter much if the pages are clean<= br> or dirty.=C2=A0 Right after mapping or under intense access pattern, they s= it
on the active list.=C2=A0 If not touched long enough, or cycled through the=
buffer cache for io (but ZFS pages not go through buffer cache), they
are moved to inactive.

>
>
> >
> > You can try to inspect the most outstanding objects adding to the=
> > inactive queue with 'vmobject -o' to see where the most o= f inactive pages
> > come from.
> >
>
> Wow, that did it!=C2=A0 About 99% of the inactive pages come from just= a few
> vnodes which are used by the FUSE servers.=C2=A0 But I also see a few = large
> entries like
> 1105308 333933 771375=C2=A0 =C2=A01=C2=A0 =C2=A00 WB=C2=A0 df
> what does that signify?
These are anonymous memory.

>
>
> >
> > If indeed they are double-copy, then perhaps ZFS can react even t= o the
> > current primitive vm_lowmem signal somewhat different. First, it = could
> > do the pass over its vnodes and
> > - free clean unmapped pages
> > - if some targets are not met after that, laundry dirty pages, > >=C2=A0 =C2=A0then return to freeing clean unmapped pages
> > all that before ever touching its cache (ARC).
> >

Follow-up:
All of t= he big inactive-memory consumers were files on FUSE file systems that were = being exported as CTL LUNs.=C2=A0 ZFS files exported by CTL do not use any = res or inactive memory.=C2=A0 I didn't test UFS.=C2=A0 Curiously, remov= ing the LUN does not free the memory, but shutting down the FUSE daemon doe= s.=C2=A0 A valid workaround is to set the vfs.fusefs.data_cache_mode sysctl= to 0.=C2=A0 That prevents the kernel from caching any data from the FUSE f= ile system.=C2=A0 I've tested this on both FreeBSD 12.2 and 13.0 .=C2= =A0 Should the kernel do a better job of reclaiming inactive memory before = ARC?=C2=A0 Yes, but in my case it's better not to create so much inacti= ve memory in the first place.=C2=A0 Thanks for everybody's help, especi= ally kib's tip about "vmstat -o".
-Alan
--0000000000006da17e05c2b4ad3b--