[Bug 280671] Memory leak on FreeBSD 13.3 and 14.1

From: <bugzilla-noreply_at_freebsd.org>
Date: Wed, 07 Aug 2024 15:20:25 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=280671

            Bug ID: 280671
           Summary: Memory leak on FreeBSD 13.3 and 14.1
           Product: Base System
           Version: 14.1-RELEASE
          Hardware: Any
                OS: Any
            Status: New
          Severity: Affects Only Me
          Priority: ---
         Component: kern
          Assignee: bugs@FreeBSD.org
          Reporter: sre@truespeed.com

Created attachment 252589
  --> https://bugs.freebsd.org/bugzilla/attachment.cgi?id=252589&action=edit
chart of memory usage

Good afternoon

We recently upgraded the operating system for one of our servers from FreeBSD
12.4 to 13.3.
This server uses the Generic Kernel, a Mirrored ZFS Zpool, and has a few Jails
on FreeBSD 12.0 that are running standard applications (Java based web
services, PostgreSQL, RabbitMQ).
The server has 96GB of RAM and was not experiencing memory shortages prior to
upgrading. 

We followed the standard upgrade process as described here:
https://docs.freebsd.org/en/books/handbook/cutting-edge/#freebsdupdate-upgrade 
Followed by upgrading packages and our zpools. We have not upgraded our Jails.

After upgrading, we began to experience what seemed like a memory leak on the
server. Over time the Inactive Memory would grow before dumping gigabytes at a
time into Laundry that was never cleaned before eventually running out of free
Memory and begin thrashing. At this point we lose access to the server, and the
services it is running become unresponsive. We resolve this by power cycling
the server, and it returns to normal use on reboot.

We are currently rebooting the server every few days before it enters the
thrashing state, but this is not a feasible long-term solution, and we believe
there is a Memory Leak that is causing this situation.

As part of debugging the memory issue, we have tried to recover memory by
turning off existing jails (as shown on the chart below by a large dip in
memory usage around 1am), but this memory is rapidly consumed again. Also, when
the server is close to entering the thrashing state, we have turned off every
jail and service (except for a few critical ones, ie SSHd) to see how much
memory is being “lost”, and it was about 39GB~, with 9GB used by ARC.
Mem: 63M Active, 8656M Inact, 18G Laundry, 17G Wired, 50G Free

Also, we have limited ARC usage with the following sysctl vfs.zfs.arc_max, but
that hasn’t made any meaningful impact.

There is nothing else standing out on the server, no unusual CPU utilisation,
no unusual network traffic, all the crons are as before the upgrades, and we
haven’t deployed any additional jails or services to the server.

We then upgraded from 13.3 to 14.1 as there was a ZFS Memory Leak Errata in the
14.1 release notes:
https://www.freebsd.org/security/advisories/FreeBSD-EN-24:10.zfs.asc, but that
hasn’t resolved our issue.
As you can see from the charts below this is the memory usage pattern we are
dealing with, this data is being pulled from sysctl by the node_exporter for
Prometheus.

The attached chart_1 shows the memory usage of the server over the last week.

The attached chart_2 shows the final hour before the server begins thrashing.
Laundry grows to 62GB and Inactive and Free Memory are both reduced to <1GB.

The available swap is 4GB but it does not seem like it’s getting used to
justify needing to increase the swap space. We have also temporarily disabled
the SWAP entirely and that hasn’t made any difference.

If you require any additional information, please let us know.

Kind regards,
Truespeed

-- 
You are receiving this mail because:
You are the assignee for the bug.