[Bug 279021] Random phantom files by g_new_bio() failure

From: <bugzilla-noreply_at_freebsd.org>
Date: Thu, 16 May 2024 06:59:03 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=279021

            Bug ID: 279021
           Summary: Random phantom files by g_new_bio() failure
           Product: Base System
           Version: 14.0-STABLE
          Hardware: Any
                OS: Any
            Status: New
          Severity: Affects Some People
          Priority: ---
         Component: kern
          Assignee: bugs@FreeBSD.org
          Reporter: seigo.tanimura@gmail.com

A bug in g_new_bio() is suspected to cause the random phantom files often
silently; expoited during the poudriere-bulk(8) test on bug #275594, comment
#147.

* Test Environment: Hypervisor
- CPU: Intel Core i7-13700KF 3.4GHz (24 threads)
- RAM: 128 GB
- OS: Windows 10
- Storage: NVMe and SATA HDDs
- Hypervisor: VMWare Workstation 17.5

* Test Environment: VM & OS
- vCPUs: 16
- RAM: 16 GB
- Swap: 128 GB on NVMe
- OS: FreeBSD 14.1-BETA2
  - All of the releng/14.1 fixes in bug #275594, comment #147 applied.
- Storage & Filesystems: ZFS mainly
  - Main pool: 1.5G on SATA HDD
  - ZIL: 16 GB on NVMe
  - L2ARC: 64 GB on NVMe

* Application
- poudriere
  - Number of ports to build: 2325 (including dependencies)
  - Major configurations for port building
    - poudriere.conf
      - #NO_ZFS=yes (ZFS enabled)
      - USE_PORTLINT=no
      - USE_TMPFS="wrkdir data localbase"
      - TMPFS_LIMIT=32
      - DISTFILES_CACHE=(configured in ZFS)
      - CCACHE_DIR=(configured in ZFS)
        - The cache is cleared in advance.
      - CCACHE_STATIC_PREFIX=/usr/local
      - PARALLEL_JOBS=16 (actually givin via "poudriere bulk -J")
    - make.conf
      - MAKE_JOBS_NUMBER=4

* Steps
1. Remove the package output directory, so that all packages are built.
2. Clear the ccache contents by "ccache -C".
3. Run 'poudriere bulk' to start the parallel build.
4. Observe the system and build progress by top(1), poudriere web UI,
cmdwatch(1) + sysctl(8), etc.

* Expected results
- All of the ports are built successfully.

* Observed behaviors during building
- In about 2 hours, the RAM went out and the kernel started swapping out the
pages.
- The bulk port build failed at random.
  + A header file or a library provided via the dependency was often missing.
- The kernel occasionally logged "swap_pager: cannot allocate bio".
- vm.uma.g_bio.stats.fails increased up to ~5000.

* Analysis
g_new_bio(), the kernel function that allocates a new bio in the non-blocking
manner, returns NULL if the g_bio uma(9) zone has no free items.  While such
the case is regarded as a rare error with an ordinary HDD, an nvme(4) storage
is likely to trigger that issue because of its high capacity for the parallel
I/O operations.

Although not confirmed precisely, the effect of this issue seems to include the
phantom files, ie the files created newly do not become visible immediately. 
Under poudriere-bulk(8), it is suspected that the files installed during
build-depends and lib-depends are not detected as expected.  The problem
happens at random; it is up to the state of the g_bio zone.

No logs are emitted by g_new_bio() in case of an allocation failure.  An
exception is the swap pager, which logs "swap_pager: cannot allocate bio".  The
increase of vm.uma.g_bio.stats.fails is the sole record of the errors.

* Proposed Fix and Test Results
Reserve some bios for the non-blocking allocation.  Uma(9) supports the item
reservation, which can be used to implement the fix.  NB the item reservation
of uma(9) can be configured at the boot time only, in practice.

The proposed fix has been committed to the submitter's GitHub repository and
made public.

New Loader Tunable:
- kern.geom.reserved_new_bios
  The number of the bios reserved for the non-blocking allocation.  (Default:
65536)
  Zero means no bios are reserved.  Due to the limitation on the uma(9) zone,
this configuration cannot be altered upon a running host.

All of the sources are under
https://github.com/altimeter-130ft/freebsd-freebsd-src.

            |                                   | Git Commit Hash
Base Branch | Fix Branch                        | Base            | Fix
============+===================================+=================+============
main        | topic-bio-reservation             | c1ebd76c3f      | c784b64b8a
------------+-----------------------------------+-----------------+------------
stable/14   | stable/14-topic-bio-reservation   | 3c414a8c2f      | aeaac96a7a
------------+-----------------------------------+-----------------+------------
releng/14.1 | releng/14.1-topic-bio-reservation | e3e57ae30c      | 8f0281d20d
------------+-----------------------------------+-----------------+------------
releng/14.0 | releng/14.0-topic-bio-reservation | d338712beb      | 6f8fed52ee
------------+-----------------------------------+-----------------+------------
stable/13   | stable/13-topic-bio-reservation   | 85e63d952d      | 64b9962cec
------------+-----------------------------------+-----------------+------------
releng/13.3 | releng/13.3-topic-bio-reservation | be4f1894ef      | 4d233d7419
------------+-----------------------------------+-----------------+------------
releng/13.2 | releng/13.2-topic-bio-reservation | f5ac4e174f      | 7b156cbac8

Poudriere-bulk(8) has been tested with the releng/14.1-topic-bio-reservation
branch (and the ZFS fix on bug #275594, comment #147), with the following
results proving the fix:
- vm.uma.g_bio.stats.fails did not increase at all.
- "swap_pager: cannot allocate bio" did not appear in the log at all.
- The build error disappeared completely.
  + Only one port (graphics/gimp-app) failed, but due to a separate problem.
(An internal error of clang.)

-- 
You are receiving this mail because:
You are the assignee for the bug.