[Bug 280216] UFS deadly hangs while removing snapshot

From: <bugzilla-noreply_at_freebsd.org>
Date: Wed, 10 Jul 2024 11:31:48 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=280216

            Bug ID: 280216
           Summary: UFS deadly hangs while removing snapshot
           Product: Base System
           Version: Unspecified
          Hardware: amd64
                OS: Any
            Status: New
          Severity: Affects Only Me
          Priority: ---
         Component: kern
          Assignee: bugs@FreeBSD.org
          Reporter: ant_mail@inbox.ru

I have a very sad situation with a production server which force me to break my
weekends.

Server hangs on some friday nights and have to be bringed to life by phisically
power off/on. This begun at autumn '23.

It appeared as filesystem hanging: server respond to ping but every I/O
operation hangs.

I'm running 12-STABLE and may be there is a some relation with commits made
during July-October '23.

It was hard to explore because of production server and total number incidents
is about 7-8. So what I've founded.

I'm using 'snapshot' (package freebsd-snapshot) utility to make periodic
snapshot. It contain the following lines of code:

                logger -p daemon.notice \
                    "snapshot: removing $fs_dir/.snap/$fs_tag.$"
                system rm -f $fs_dir/.snap/$fs_tag.$i

Last messages that was logged in system are:

Jun 28 22:10:06 serv root[52374]: snapshot: rotating snapshots
Jun 28 22:10:06 serv root[52375]: snapshot: rm /data/office/.snap/weekly.3
Jun 29 09:47:28 serv syslogd: kernel boot file is /boot/kernel/kernel
Jun 29 09:47:28 serv kernel: ---<<BOOT>>---

There is no evidence that system has any successfull UFS reads or writes after
'rm' was engaged.

After power off/on fsck found errors on some partitions but the problematic
partition (/data/office) has no error. And there is no problem to remove
snapshot (doing rm /data/office/.snap/weekly.3)

There are other UFS partitions on this server which doing UFS snapshot same way
but it never hangs.

UFS parameters of data/office:

tunefs: POSIX.1e ACLs: (-a)                                enabled
tunefs: NFSv4 ACLs: (-N)                                   disabled
tunefs: MAC multilabel: (-l)                               disabled
tunefs: soft updates: (-n)                                 disabled
tunefs: soft update journaling: (-j)                       disabled
tunefs: gjournal: (-J)                                     enabled
tunefs: trim: (-t)                                         disabled
tunefs: maximum blocks per file in a cylinder group: (-e)  4096
tunefs: average file size: (-f)                            512000
tunefs: average number of files in a directory: (-s)       64
tunefs: minimum percentage of free space: (-m)             12%
tunefs: space to hold for metadata blocks: (-k)            6408
tunefs: optimization preference: (-o)                      time

What was tried:

creating new enlarged partition, making newfs on it, dumping and restoring data
to the new partition. After couple of month the server hangs again. 

I suppose that problem arise when the size of snapshot getting large. This
explain why it hangs on some fridays only: removing oldest snapshot is a
removing largest snapshot and when it size is more than some thresholds it
hangs.

Currently I have those size of snapshot:
/data/office/    ufs    464GB   40.0%     44GB    3.8%  weekly.2       
2024-06-07T22:11
/data/office/    ufs    464GB   40.0%     22GB    1.9%  weekly.1       
2024-06-14T22:10
/data/office/    ufs    464GB   40.0%     18GB    1.5%  weekly.0       
2024-06-21T22:11
/data/office/    ufs    464GB   40.0%      9GB    0.8%  daily.2        
2024-07-08T00:03
/data/office/    ufs    464GB   40.0%    741MB    0.1%  daily.1        
2024-07-09T00:03
/data/office/    ufs    464GB   40.0%    784MB    0.1%  hourly.1       
2024-07-09T16:01
/data/office/    ufs    464GB   40.0%    594MB    0.0%  daily.0        
2024-07-10T00:03
/data/office/    ufs    464GB   40.0%    590MB    0.0%  hourly.0       
2024-07-10T12:01

Any help is greatly appreciated.

-- 
You are receiving this mail because:
You are the assignee for the bug.