ZFS vs UFS2 overhead and may be a bug?

Thu May 3 20:26:17 UTC 2007

On Tue, May 01, 2007 at 10:22:43PM -0700, Bakul Shah wrote:
> Here is a surprising result for ZFS.
> 
> I ran the following script on both ZFS and UF2 filesystems.
> 
> $ dd </dev/zero bs=1m count=10240 >SPACY# 10G zero bytes allocated
> $ truncate -s 10G HOLEY			# no space allocated
> 
> $ time dd <SPACY >/dev/null bs=1m	# A1
> $ time dd <HOLEY >/dev/null bs=1m	# A2
> $ time cat SPACY >/dev/null		# B1
> $ time cat HOLEY >/dev/null		# B2
> $ time md5 SPACY			# C1
> $ time md5 HOLEY			# C2
> 
> I have summarized the results below.
> 
> 		      ZFS	     UFS2
> 		Elapsed System	Elapsed	System	       Test
> dd SPACY bs=1m  110.26   22.52	340.38	 19.11		A1
> dd HOLEY bs=1m   22.44   22.41	 24.24	 24.13		A2
> 
> cat SPACY	119.64   33.04	342.77	 17.30		B1
> cat HOLEY	222.85  222.08	 22.91	 22.41		B2
> 
> md5 SPACY	210.01	 77.46	337.51	 25.54		C1	
> md5 HOLEY	856.39	801.21	 82.11	 28.31		C2
> 
> 
> A1, A2:
> Numbers are more or less as expected.  When doing large
> reads, reading from "holes" takes far less time than from a
> real disk.  We also see that UFS2 disk is about 3 times
> slower for sequential reads.
> 
> B1, B2:
> UFS2 numbers are as expected but ZFS numbers for the HOLEY
> file are much too high.  Why should *not* going to a real
> disk cost more?  We also see that UFS2 handles holey files 10
> times more efficiently than ZFS!
> 
> C1, C2:
> Again UFS2 numbers and C1 numbers for ZFS are as expected.
> but C2 numbers for ZFS are very high.  md5 uses BLKSIZ (==
> 1k) size reads and does hardly any other system calls.  For
> ZFS each syscall takes 76.4 microseconds while UFS2 syscalls
> are 2.7 us each!  zpool iostat shows there is no IO to the
> real disk so this implies that for the HOLEY case zfs read
> calls have a significantly higher overhead or there is a bug.
> 
> Basically C tests just confirm what we find in B tests.

Interesting. There are two problems. First is that cat(1) uses
st_blksize to find out best size of I/O request and we force it to
PAGE_SIZE, which is very, very wrong for ZFS - it should be equal to
recordsize. I need to find discussion about this:

	/*
	 * According to www.opengroup.org, the meaning of st_blksize is 
	 *   "a filesystem-specific preferred I/O block size for this 
	 *    object.  In some filesystem types, this may vary from file
	 *    to file"
	 * Default to PAGE_SIZE after much discussion.
	 * XXX: min(PAGE_SIZE, vp->v_bufobj.bo_bsize) may be more
	 * correct.
	 */

	sb->st_blksize = PAGE_SIZE;

For example cp(1) just uses MAXBSIZE, which is also not really good, but
at least MAXBSIZE is much bigger than PAGE_SIZE (it's 64kB).

So bascially what you observed with cat(1) is equivalent of running
dd(1) with bs=4k.

I tested it on Solaris and this is not FreeBSD-specific problem, the
same is on Solaris. Is there a chance you could send your observations
to zfs-discuss at opensolaris.org, but just comparsion between dd(1) with
bs=128k and bs=4k (the other tests might be confusing).

-- 
Pawel Jakub Dawidek                       http://www.wheel.pl
pjd at FreeBSD.org                           http://www.FreeBSD.org
FreeBSD committer                         Am I Evil? Yes, I Am!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 187 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20070503/9799905d/attachment.pgp