ZFS performance notes

Thu Sep 23 10:43:36 UTC 2010

[This is extracted from a series of mails Samir and I wrote while
investigating ZFS performance on FreeBSD. Version doesn't matter; we used
both v14 and v25 with largely similar results.]

We've finally got some understanding of ZFS performance in the sync-write
case (important for NFS and file server loads) and why it's as low as it is.
Bottom line: It's not ZFS's fault. We need a fast NVRAM-type log device if
we want performance.

This is the iostat output for an 8k sync write load with iozone on a ZFS
filesystem (FreeBSD 8.1, local, no NFS, pool consisting of just one 7200rpm
SATA disk)
iozone -s 100m -r 8 -i 0 -o -f /tp/fs/f1 (gives 935 KB/sec throughput)

       tty             ad4              ad6             ad10             cpu
 tin  tout  KB/t tps  MB/s   KB/t tps  MB/s   KB/t tps  MB/s  us ni sy in id

   0    39  0.00   0  0.00  12.00 118  1.39   0.00   0  0.00   0  0  0  0 100

   0    39  0.00   0  0.00  12.00 117  1.38   0.00   0  0.00   0  0  0  0 100

   0    39  0.00   0  0.00  69.32 220 14.88   0.00   0  0.00   0  0  0  0 99

   0    39  0.00   0  0.00  12.00 118  1.38   0.00   0  0.00   0  0  1  0 99

   0    39  0.00   0  0.00  12.00 118  1.39   0.00   0  0.00   0  0  0  0 100

Several interesting things here. Note that each transaction is 12K. This is
as we expect, each 8k write is padded with 4k of ZIL stuff. We're able to to
118 transactions per sec (aka IOPS).

The steady state throughput on disk is 1.39 MB/sec, correlating well to the
935 KB/sec observed by the application after discounting ZIL metadata
overhead. The spike of 14MB/sec corresponds to a flush of log data to
primary fs.

Removing the sync option, we try
iozone -s 1000m -r 8 -i 0 -f /tp/fs/f1
and see that ZFS can generate quite a storm of IO.

       tty             ad4              ad6             ad10             cpu
 tin  tout  KB/t tps  MB/s   KB/t tps  MB/s   KB/t tps  MB/s  us ni sy in id

   0    39  8.38   2  0.02  128.00 751 93.82   0.00   0  0.00   0  0 13  1 86

   0    39  0.00   0  0.00  128.00 1020 127.56   0.00   0  0.00   0  0  1  1 98

   0    40  0.00   0  0.00  128.00 1015 126.92   0.00   0  0.00   0  0  2  1 97

   0    40  0.00   0  0.00  124.94 702 85.66   0.00   0  0.00   0  0 14  0 85

Now, the question is: if ZFS can generate 120MB/sec sustained throughput to
the disk, why is the ZIL throttled so low?

And the answer is...

On-disk write cache. <ba-dum chesh!>

By default, FreeBSD enables on-disk cache. That's why iozone to a raw disk
shows very large numbers:
iozone -s 1000m -r 8 -i 0 -o -f /dev/ad10 (gives 80 MB/sec)

       tty             ad4              ad6             ad10             cpu

 tin  tout  KB/t tps  MB/s   KB/t tps  MB/s   KB/t tps  MB/s  us ni sy in id

   0    39  0.00   0  0.00   0.00   0  0.00   8.00 10031 78.36   0  0  9  6 86

   0    40  0.00   0  0.00   0.00   0  0.00   8.00 10019 78.28   0  0  9  6 84

   0    40  0.00   0  0.00   0.00   0  0.00   8.00 10040 78.43   0  0  9  6 85

Note that we're doing 8k IOs as expected, but we're completing 10,000 of
them in a second (10,000 IOPS!) Clearly, this is because the disk is caching
them. They're not going to the platter, and might very well disappear if
there is a power failure. Default UFS on such a disk therefore exhibits good
performance.

At higher IOsizes (64-128K), we max out at 120MB/sec - not bad for a single
7200 RPM 500Gig SATA disk, what?

Random reads of 8k on raw disk give ~1.4MB/sec (since nothing can be cached)
and this is a real reflection on platter performance. Notice that this
correlates well with the ZFS sync-write numbers we got first. Random writes
are higher (~3MB/sec), but nowhere near the max rate, reflecting that the
disk cache can't absorb these as well as writes coming in a sequence.

Now if we turn FreeBSD write cache off, (add a line hw.ata.wc = 0 to
/boot/loader.conf and reboot), then we see, for sync IO to a raw device,
iozone -s 10m -r 12 -i 0 -o -f /dev/ad10, a throughput of 1.4MB/sec, exactly
the same as the first ZFS case.

      tty             ad4              ad6             ad10             cpu

 tin  tout  KB/t tps  MB/s   KB/t tps  MB/s   KB/t tps  MB/s  us ni sy in id

   0    39  0.00   0  0.00   0.00   0  0.00  12.00 119  1.39   0  0  0  0 100

   0    39  0.00   0  0.00   0.00   0  0.00  12.00 118  1.39   0  0  0  0 100

   0    39  0.00   0  0.00   0.00   0  0.00  12.00 119  1.40   0  0  0  0 100

I chose 12k to correspond to the 8k data +4k ZIL overhead workload. We see
that IOPS drops to ~120 from ~10,000 when we turn off the disk cache.

UFS does *worse* than ZFS if we turn off the write-cache.
iozone -s 10m -r 8 -i 0 -o

   0    39  0.00   0  0.00   0.00   0  0.00  16.00 145  2.27   0  0  0  0 100

It manages to generate 2.2 MB/sec throughput, but notice that it turned
every 8k app write to a 16k write. The actual throughput as seen by iozone
is 495K/sec, a fourth of what's going to disk. This is because of the major
inefficiencies in UFS sync writes (8k->16k, metadata updates) as discussed
earlier.

When we do async writes (also with write-cache off), it goes upto a
throughput of ~14MB/sec.
iozone -s 100m -r 8 -i 0

   0    40  0.00   0  0.00   0.00   0  0.00  128.00 107 13.43   0  0  0  0 100

   0    40  0.00   0  0.00   0.00   0  0.00  128.00 108 13.56   0  0  0  0 100

Note that UFS gathers writes into 128k chunks, but we're band-limited by
IOPs here. The platters can't do more than 110-120 IOPS (when the cache is
useless - either turned off or due to random load)

With write-cache turned off, ZFS async writes give exactly the same
performance as UFS - ~14MB/sec, 108 IOPS, 128k IO sizes. Sync writes are the
same old 1.3MB/sec.

So, ZFS is actually being pretty smart. In the normal case, when disk
write-cache is turned on, it uses the cache for all regular writes, but when
it comes to ZIL writes, it makes sure that the bytes hit the platter. So the
ZIL, although in theory sequential, actually degenerates to close to random
write performance, because we cannot gather the IO at any stage. 8k hits the
platter, we go back to the application, which spits out 8k more. Meanwhile,
the disk has spun so the second IO has to wait for the planets to align
again.

There is a ZFS tunable which says "don't force a platter write for ZIL,
treat it as normal IO".
Set vfs.zfs.cache_flush_disable="1" in /boot/loader.conf and reboot

Now you will see the ZIL unfettered by the ~100 IOPS limit
iozone -s 1000m -r 8 -i 0 -o -f /tp/fs/f1 (gives ~40MB/sec throughput, up
from 1.3!)

   0    39  0.00   0  0.00  12.00 5530 64.80   0.00   0  0.00   0  0 13  3 84
   0    39  0.00   0  0.00  12.00 5506 64.53   0.00   0  0.00   1  0 11  2 86

Note that we go to 5k IOPS and 64MB/sec ZIL throughput.

To summarize so far:

   1. If we turn off the disk write-cache, we are IOPS-limited. 128k *
   100-120 IOPS is the best we can hope for under any circumstances per disk of
   this type.
    2. ZFS does the sane thing, using disk cache when possible and avoiding
   it when semantics demand. In our case that boils down to ZIL and platter
   writes every time, since our NFS-client issues sync writes.
    3. If we want to honour NFS sync write semantics (and we must), without
   getting bad performance,
      1. some kind of NVRAM based log accelerator is needed.
      2. large number of spindles in mirror-stripe config (untested)

All this has been for single-threaded loads. Now let's look into logbias and
scaling with multi-threaded loads.

Multi-threading does break through the 100-120 IOPS sound barrier for Real
Sync Writes (TM) if and only if disk write cache is enabled. We can go upto
300 IOPS, delivering an aggregate of 27MB/sec for 8k, 64 threads. Per-thread
throughput remains low, at around 500K/sec.

   1. How does this happen? As we saw earlier, ZFS performs best when the
   write cache is enabled. In such cases,
    1. ZIL writes are done to the log device, going to the disks write-cache
      2. then a cache flush is issued
      3. On successful flush (bits have hit platter), return success to the
      caller.
   2. If another ZIL write happens between 1.1 and 1.2 above
   (multi-threading case) then one cache flush does for both. This is what
   enables us to break through the IOPS barrier without compromising
   correctness.
   3. If we turn off the write-cache, then we simply cannot bust the 100
   IOPS barrier, no matter how many  threads we use. This serves to confirm the
   above.

All tests have been with logbias=latency. If we change logbias=throughput,
an FS-level option, ZFS avoids using the separate log device for this
filesystem and dumps it directly to primary storage. This is used when there
are multiple filesystems in a pool and a log device configured on an SSD.
Instead of flooding the limited SSD with sync-writes from all filesystems,
you may want to selectively do it for a few (logbias=latency) and not for
others (logbias=throughput). Not much use for us.
Multiple disks, no log device. Adding multiple disks to the pool does help;
ZFS nicely distributes data, including sync-writes to all disks.
Ramdisk (as a substitute for NVRAM) as log disk improved performance
significantly, as expected, to about 40 MB/sec for a single-threaded 8k sync
write.
 Ramdisks sometimes cause odd oscillations - the log gets full very fast,
ZFS can't or doesn't drain the pent-up IO to primary fast enough, then ZFS
switches to the primary disks for writing the log for several seconds. This
results in sharp oscillations in throughput.
 Ramdisks also help a lot with RAID-Z performance. Everything gets staged
through the log, thus helping to gather large-striped writes in memory.
Rewrites do worse than writes when the IO size is less than the record size.
Dedup has negligible effect on throughput when measured for single-threaded
loads. Apparently performance falls off a cliff when the dedup table runs
out of memory, but we didn't push it that far.
Tentative recommendations:

   1. Use a dedicated NVRAM device for the log. Size should be at least 1
   GB, ideally 2-4 GB. Random write latencies for sizes from 8k-64k to NVRAM
   should be good.
   2. Since all writes are sync, increasing RAM size beyond 2x of NVRAM size
   won't help much. So 8 GB RAM should be enough for a typical 2-4 GB NVRAM
   case. We expect reads to be cached at the client.
   3. For dedup, many people recommend keeping a flash device which can do
   fast random-access reads to keep the dedup table. This avoids the
   performance cliff for typical storage and record sizes without requiring
   extra RAM.
    4. Mirroring is preferred to RAID-Z for raw performance. RAID-Z ain't
   bad though, if we have an NVRAM log.
   5. Set recordsize to 32k - NFS write ops don't exceed this size (this
   needs more study)
   6. Leave logbias to its default value, 'latency'.

More accurate measurements and sizing recommendations can be made if we can
get our hands on a "typical" customer system, with 6 or 12 spindles and an
NVRAM device.