AWS - UFS corrupted when restoring from AWS Backup service

From: <bogdan-lists_at_neant.ro>
Date: Sat, 23 Jul 2022 08:33:03 UTC
Hello,

TL;DR: We have a bunch of EC2 machines in AWS running FreeBSD. AMI from Market, file system is UFS.  We have AWS Backup service taking hourly snapshots of these machines (AMI + EBS snapshots I believe). After a few months of snapshots we had to restore one of them and found out that the file system is corrupted and fsck was not able to recover it. We are going to enable sync in fstab, see if that helps, but it’s hard to know because it is hard to reproduce the problem, and details about how everything works are fuzzy to me.

Longer version:

We use FreeBSD on web servers in AWS. Until January we were doing weekly AMI snapshots by running a script that would shut down the machine, create the AMI, then start the machine back up. Which worked for a long time, but is less than ideal and shutting down production more often than weekly is rude.

At the start of this year we switched to running AWS Backup hourly. It takes snapshots of a running machine without stopping it. I believe it’s the same as creating an AMI and checking the “No reboot” checkbox. It should use the same API call, but I wouldn’t know. We ran a few recovery tests, we read the docs, we confirmed with support, everything looked like it should work with no issues.

A couple of weeks ago the EBS disk on one of the machines failed and we needed to restore it. When we did, it ran fsck on boot (which it didn’t on our previous tests) and failed to recover it, so the machine was effectively dead. I know we can mount the disk on a different machine and recover (some) data, that’s not the point. We tried a few backups going back two weeks, same issue. We tried a few more instances, about 5, all of them ran fsck on boot. A couple were recovered, but it doesn’t matter, it still means it’s not working as we thought. So now we’re effectively running without backups on EC2 instances.

I’m not sure why it happens. Information is sparse and I’m making a lot of assumptions. Basically I believe that the snapshot process is equivalent to cutting off power to the machine and that happens every hour for months. The docs on UFS soft updates say that there’s a small chance of data loss, but since that power-cutting snapshot happens every hour over a time of months, that chance isn’t that small any more. Still, apparently Linux doesn’t have this problem, and everywhere I read it says that data might be lost, but the file system should not be corrupted. And yet fsck isn’t always able to recover it.

As far as I understand, with soft updates and “noasync” in fstab (default), data is flushed to disk about every 30 seconds (according to syncer man page), asynchronously, while metadata is written synchronously. I’m thinking that maybe that’s an issue and turning on sync in fstab might help. On the other hand, the man page for syncer says “It is possible on some systems that a sync(2) occurring simultaneously with a crash may cause file system damage.”, which means it might make it worse? I don’t know.

We were not able to reproduce the problem reliably so that we can test. I’m not sure if or how anyone can help. I just wanted to send this message so that at least some other people are aware that AWS Backup doesn’t play nice with FreeBSD.