zfs raidz overhead

Eric A. Borisch eborisch at gmail.com
Tue Feb 21 23:31:34 UTC 2017

On Tue, Feb 21, 2017 at 2:45 AM, Eugene M. Zheganin <emz at norma.perm.ru>


There's an interesting case described here:

It's a user story who encountered that under some situations zfs on
raidz could use up to 200% of the space for a zvol.

I have also seen this. For instance:

[root at san1:~]# zfs get volsize gamestop/reference1
 gamestop/reference1 volsize 2,50T local
 [root at san1:~]# zfs get all gamestop/reference1
 gamestop/reference1 type volume -
 gamestop/reference1 creation чт нояб. 24 9:09 2016 -
 gamestop/reference1 used 4,38T -
 gamestop/reference1 available 1,33T -
 gamestop/reference1 referenced 4,01T -
 gamestop/reference1 compressratio 1.00x -
 gamestop/reference1 reservation none default
 gamestop/reference1 volsize 2,50T local
 gamestop/reference1 volblocksize 8K -
 gamestop/reference1 checksum on default
 gamestop/reference1 compression off default
 gamestop/reference1 readonly off default
 gamestop/reference1 copies 1 default
 gamestop/reference1 refreservation none received
 gamestop/reference1 primarycache all default
 gamestop/reference1 secondarycache all default
 gamestop/reference1 usedbysnapshots 378G -
 gamestop/reference1 usedbydataset 4,01T -
 gamestop/reference1 usedbychildren 0 -
 gamestop/reference1 usedbyrefreservation 0 -
 gamestop/reference1 logbias latency default
 gamestop/reference1 dedup off default
 gamestop/reference1 mlslabel -
 gamestop/reference1 sync standard default
 gamestop/reference1 refcompressratio 1.00x -
 gamestop/reference1 written 4,89G -
 gamestop/reference1 logicalused 2,72T -
 gamestop/reference1 logicalreferenced 2,49T -
 gamestop/reference1 volmode default default
 gamestop/reference1 snapshot_limit none default
 gamestop/reference1 snapshot_count none default
 gamestop/reference1 redundant_metadata all default

[root at san1:~]# zpool status gamestop
 pool: gamestop
 state: ONLINE
 scan: none requested

 gamestop ONLINE 0 0 0
 raidz1-0 ONLINE 0 0 0
 da6 ONLINE 0 0 0
 da7 ONLINE 0 0 0
 da8 ONLINE 0 0 0
 da9 ONLINE 0 0 0
 da11 ONLINE 0 0 0

 errors: No known data errors

or, another server (overhead in this case isn't that big, but still

[root at san01:~]# zfs get all data/reference1
 data/reference1 type volume -
 data/reference1 creation Fri Jan 6 11:23 2017 -
 data/reference1 used 3.82T -
 data/reference1 available 13.0T -
 data/reference1 referenced 3.22T -
 data/reference1 compressratio 1.00x -
 data/reference1 reservation none default
 data/reference1 volsize 2T local
 data/reference1 volblocksize 8K -
 data/reference1 checksum on default
 data/reference1 compression off default
 data/reference1 readonly off default
 data/reference1 copies 1 default
 data/reference1 refreservation none received
 data/reference1 primarycache all default
 data/reference1 secondarycache all default
 data/reference1 usedbysnapshots 612G -
 data/reference1 usedbydataset 3.22T -
 data/reference1 usedbychildren 0 -
 data/reference1 usedbyrefreservation 0 -
 data/reference1 logbias latency default
 data/reference1 dedup off default
 data/reference1 mlslabel -
 data/reference1 sync standard default
 data/reference1 refcompressratio 1.00x -
 data/reference1 written 498K -
 data/reference1 logicalused 2.37T -
 data/reference1 logicalreferenced 2.00T -
 data/reference1 volmode default default
 data/reference1 snapshot_limit none default
 data/reference1 snapshot_count none default
 data/reference1 redundant_metadata all default
 [root at san01:~]# zpool status gamestop
 pool: data
 state: ONLINE
 scan: none requested

 data ONLINE 0 0 0
 raidz1-0 ONLINE 0 0 0
 da3 ONLINE 0 0 0
 da4 ONLINE 0 0 0
 da5 ONLINE 0 0 0
 da6 ONLINE 0 0 0
 da7 ONLINE 0 0 0
 raidz1-1 ONLINE 0 0 0
 da8 ONLINE 0 0 0
 da9 ONLINE 0 0 0
 da10 ONLINE 0 0 0
 da11 ONLINE 0 0 0
 da12 ONLINE 0 0 0
 raidz1-2 ONLINE 0 0 0
 da13 ONLINE 0 0 0
 da14 ONLINE 0 0 0
 da15 ONLINE 0 0 0
 da16 ONLINE 0 0 0
 da17 ONLINE 0 0 0

 errors: No known data errors

So my question is - how to avoid it ? Right now I'm experimenting with
the volblocksize, making it around 64k. I'm also suspecting that such
overhead may be the subsequence of the various resizing operations, like
extening the volsize of the volume or adding new disks into the pool,
because I have a couple of servers with raidz where the initial
disk/volsize configuration didn't change, and the referenced/volsize
numbers are pretty much close to each other.



It comes down to the zpool's sector size (2^ashift) and the volblocksize --
I'm guessing your old servers are at ashift=9 (512), and the new one is at
12 (4096), likely with 4k drives. This is the smallest/atomic size of reads
& writes to a drive from ZFS.

As described in [1]:
 * Allocations need to be a multiple of (p+1) sectors, where p is your
parity level; for raidz1, p==1, and allocations need to be in multiples of
(1+1)=2 sectors, or 8k (for ashift=12; this is the physical size /
alignment on drive).
 * It also needs to have enough parity for failures, so it also depends [2]
on number of drives in pool at larger block/record sizes.

So considering those requirements, and your zvol with volblocksize=8k and
compression=off, allocations for one logical 8k block are always composed
physically of two (4k) data sectors, one (p=1) parity sector (4k), and one
padding sector (4k) to satisfy being a multiple of (p+1=) 2, or 16k
(allocated on disk space), hence your observed 2x data size being actually
allocated. Each of these blocks will be on a different drive. This is
different from the sector-level parity in RAID5

As Matthew Ahrens [1] points out: "Note that setting a small recordsize
with 4KB sector devices results in universally poor space efficiency --
RAIDZ-p is no better than p-way mirrors for recordsize=4K or 8K."

Things you can do:

 * Use ashift=9 (and perhaps 512-byte sector drives). The same layout rules
still apply, but now your 'atomic' size is 512b. You will want to test
 * Use a larger volblocksize, especially if the filesystem on the zvol uses
a larger block size. If you aren't performance sensitive, use a larger
volblocksize even if the hosted filesystem doesn't. (But test this out to
see how performance sensitive you really are! ;) You'll need to use
something like dd to move data between different block size zvols.
 * Enable compression if the contents are compressible (some likely will
 * Use a pool created from mirrors instead of raidz if you need
high-performance small blocks while retaining redundancy.

You don't get efficient (better than mirrors) redundancy, performant small
(as in small multiple of zpool's sector size) block sizes, and zfs's
flexibility all at once.

 - Eric

[1] https://www.delphix.com/blog/delphix-engineering/zfs-rai
[2] My spin on Ahren's spreadsheet: https://docs.google.com/spread

More information about the freebsd-fs mailing list