sync vs async vs zfs

Thu Sep 24 19:05:59 UTC 2015

On Sep 24, 2015, at 12:40, Quartz <quartz at sneakertech.com> wrote:

> I'm trying to spec out a new system that looks like it might be very sensitive to sync vs async writes. However, after some research and investigation I've come to realize that I don't think I understand a/sync as well as I thought I did and might be confused about some of the fundamentals.

Very short answer…

Both terms refer to writes only, there is no such thing as a sync or async read.

In the case of an async write, the application code (App) asks the Filesystem (FS) to write some data. The FS is free to do whatever it wants with the data and respond immediately that is has the data and it _will_ write it to non-volatile (NV) storage (disk).

In the case of a sync write (at least as defined by Posix), the App asks the FS to write some data and do not return until it is committed to NV storage. The FS is required (by Posix) to _not_ acknowledge the write until the data _has_ been committed to NV storage.

So in the first case, the FS can accept the data, put it in it’s “write cache”, typically RAM, and respond to the App that the write is complete. When the FS has the time it then commits the data to NV storage. If the system crashes after the App has “written” the data but before the FS has committed it to NV storage, that data is lost.

In the second case, the FS _must_not_ respond to the APP until the data is committed to NV storage. The App can be certain that the data is safe. This is critical for, among other things, databases processing transactions in specific order or time.

> Can someone point me to a good "newbie's guide" that explains sync vs async from the ground up? one that makes no assumptions about prior knowledge of filesystems and IO. And likewise, another guide specifically for how they relate to zfs pool/vdev configuration?

I don’t know of a basic guide to this, I just learned it from various places over 20 years in the business.

In terms of ZFS, the ARC acts as both write buffer and read cache. You can see this easily when running benchmarks such as iozone with files smaller than the amount of RAM. When making an async write call the FS responds almost immediately and you are measuring the efficiency of the ZFS code and memory bandwidth :-) I have seen write performance in the 10’s of GB/sec on drives that I know do not have that kind of bandwidth. Make the ARC too small to hold the entire file or make the file too big to fit you start seeing the performance of the drives. This is due (in part) to the TXG design of ZFS. You can watch the drives (via iostat -x) and see ZFS committing data in bursts (originally up to 30 seconds apart, now up to 5 seconds apart).

Now when you issue a sync write to ZFS, in order to adhere to Posix requirements, ZFS _must_ commit the data to NV storage before returning an acknowledgement to the App. So ZFS has the ZIL (ZFS Intent Log). All sync writes are committed to the ZIL immediately and then incorporated into the dataset itself as TXGs commit. The ZIL is just space stolen from the zpool _unless_ you have a Separate Log Device (SLOG), which is just a special type of vdev (like spare) and is listed as “log” in a zpool status. By having a SLOG you can do two things, 1) ZFS no longer needs to steal space from the dataset for the ZIL, so the dataset will be much less fragmented and 2) you can use a device which is much faster than the main zpool devices (like a ZeusRAM or fast SSD) and greatly speed up sync writes.

You can see the performance difference between async and sync using iozone with the -o option. From the iozone manage: "Writes are synchronously written to disk. (O_SYNC).  Iozone will open the files with the O_SYNC flag. This forces all  writes  to the file to go completely to disk before returning to the benchmark.”

I hope this gets you started …

--
Paul Kraus
paul at kraus-haus.org