[Bug 279021] Random phantom files by g_new_bio() failure
Date: Thu, 16 May 2024 06:59:03 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=279021 Bug ID: 279021 Summary: Random phantom files by g_new_bio() failure Product: Base System Version: 14.0-STABLE Hardware: Any OS: Any Status: New Severity: Affects Some People Priority: --- Component: kern Assignee: bugs@FreeBSD.org Reporter: seigo.tanimura@gmail.com A bug in g_new_bio() is suspected to cause the random phantom files often silently; expoited during the poudriere-bulk(8) test on bug #275594, comment #147. * Test Environment: Hypervisor - CPU: Intel Core i7-13700KF 3.4GHz (24 threads) - RAM: 128 GB - OS: Windows 10 - Storage: NVMe and SATA HDDs - Hypervisor: VMWare Workstation 17.5 * Test Environment: VM & OS - vCPUs: 16 - RAM: 16 GB - Swap: 128 GB on NVMe - OS: FreeBSD 14.1-BETA2 - All of the releng/14.1 fixes in bug #275594, comment #147 applied. - Storage & Filesystems: ZFS mainly - Main pool: 1.5G on SATA HDD - ZIL: 16 GB on NVMe - L2ARC: 64 GB on NVMe * Application - poudriere - Number of ports to build: 2325 (including dependencies) - Major configurations for port building - poudriere.conf - #NO_ZFS=yes (ZFS enabled) - USE_PORTLINT=no - USE_TMPFS="wrkdir data localbase" - TMPFS_LIMIT=32 - DISTFILES_CACHE=(configured in ZFS) - CCACHE_DIR=(configured in ZFS) - The cache is cleared in advance. - CCACHE_STATIC_PREFIX=/usr/local - PARALLEL_JOBS=16 (actually givin via "poudriere bulk -J") - make.conf - MAKE_JOBS_NUMBER=4 * Steps 1. Remove the package output directory, so that all packages are built. 2. Clear the ccache contents by "ccache -C". 3. Run 'poudriere bulk' to start the parallel build. 4. Observe the system and build progress by top(1), poudriere web UI, cmdwatch(1) + sysctl(8), etc. * Expected results - All of the ports are built successfully. * Observed behaviors during building - In about 2 hours, the RAM went out and the kernel started swapping out the pages. - The bulk port build failed at random. + A header file or a library provided via the dependency was often missing. - The kernel occasionally logged "swap_pager: cannot allocate bio". - vm.uma.g_bio.stats.fails increased up to ~5000. * Analysis g_new_bio(), the kernel function that allocates a new bio in the non-blocking manner, returns NULL if the g_bio uma(9) zone has no free items. While such the case is regarded as a rare error with an ordinary HDD, an nvme(4) storage is likely to trigger that issue because of its high capacity for the parallel I/O operations. Although not confirmed precisely, the effect of this issue seems to include the phantom files, ie the files created newly do not become visible immediately. Under poudriere-bulk(8), it is suspected that the files installed during build-depends and lib-depends are not detected as expected. The problem happens at random; it is up to the state of the g_bio zone. No logs are emitted by g_new_bio() in case of an allocation failure. An exception is the swap pager, which logs "swap_pager: cannot allocate bio". The increase of vm.uma.g_bio.stats.fails is the sole record of the errors. * Proposed Fix and Test Results Reserve some bios for the non-blocking allocation. Uma(9) supports the item reservation, which can be used to implement the fix. NB the item reservation of uma(9) can be configured at the boot time only, in practice. The proposed fix has been committed to the submitter's GitHub repository and made public. New Loader Tunable: - kern.geom.reserved_new_bios The number of the bios reserved for the non-blocking allocation. (Default: 65536) Zero means no bios are reserved. Due to the limitation on the uma(9) zone, this configuration cannot be altered upon a running host. All of the sources are under https://github.com/altimeter-130ft/freebsd-freebsd-src. | | Git Commit Hash Base Branch | Fix Branch | Base | Fix ============+===================================+=================+============ main | topic-bio-reservation | c1ebd76c3f | c784b64b8a ------------+-----------------------------------+-----------------+------------ stable/14 | stable/14-topic-bio-reservation | 3c414a8c2f | aeaac96a7a ------------+-----------------------------------+-----------------+------------ releng/14.1 | releng/14.1-topic-bio-reservation | e3e57ae30c | 8f0281d20d ------------+-----------------------------------+-----------------+------------ releng/14.0 | releng/14.0-topic-bio-reservation | d338712beb | 6f8fed52ee ------------+-----------------------------------+-----------------+------------ stable/13 | stable/13-topic-bio-reservation | 85e63d952d | 64b9962cec ------------+-----------------------------------+-----------------+------------ releng/13.3 | releng/13.3-topic-bio-reservation | be4f1894ef | 4d233d7419 ------------+-----------------------------------+-----------------+------------ releng/13.2 | releng/13.2-topic-bio-reservation | f5ac4e174f | 7b156cbac8 Poudriere-bulk(8) has been tested with the releng/14.1-topic-bio-reservation branch (and the ZFS fix on bug #275594, comment #147), with the following results proving the fix: - vm.uma.g_bio.stats.fails did not increase at all. - "swap_pager: cannot allocate bio" did not appear in the log at all. - The build error disappeared completely. + Only one port (graphics/gimp-app) failed, but due to a separate problem. (An internal error of clang.) -- You are receiving this mail because: You are the assignee for the bug.