ZFS vs UFS2 overhead and may be a bug?
Pawel Jakub Dawidek
pjd at FreeBSD.org
Thu May 3 20:26:17 UTC 2007
On Tue, May 01, 2007 at 10:22:43PM -0700, Bakul Shah wrote:
> Here is a surprising result for ZFS.
>
> I ran the following script on both ZFS and UF2 filesystems.
>
> $ dd </dev/zero bs=1m count=10240 >SPACY# 10G zero bytes allocated
> $ truncate -s 10G HOLEY # no space allocated
>
> $ time dd <SPACY >/dev/null bs=1m # A1
> $ time dd <HOLEY >/dev/null bs=1m # A2
> $ time cat SPACY >/dev/null # B1
> $ time cat HOLEY >/dev/null # B2
> $ time md5 SPACY # C1
> $ time md5 HOLEY # C2
>
> I have summarized the results below.
>
> ZFS UFS2
> Elapsed System Elapsed System Test
> dd SPACY bs=1m 110.26 22.52 340.38 19.11 A1
> dd HOLEY bs=1m 22.44 22.41 24.24 24.13 A2
>
> cat SPACY 119.64 33.04 342.77 17.30 B1
> cat HOLEY 222.85 222.08 22.91 22.41 B2
>
> md5 SPACY 210.01 77.46 337.51 25.54 C1
> md5 HOLEY 856.39 801.21 82.11 28.31 C2
>
>
> A1, A2:
> Numbers are more or less as expected. When doing large
> reads, reading from "holes" takes far less time than from a
> real disk. We also see that UFS2 disk is about 3 times
> slower for sequential reads.
>
> B1, B2:
> UFS2 numbers are as expected but ZFS numbers for the HOLEY
> file are much too high. Why should *not* going to a real
> disk cost more? We also see that UFS2 handles holey files 10
> times more efficiently than ZFS!
>
> C1, C2:
> Again UFS2 numbers and C1 numbers for ZFS are as expected.
> but C2 numbers for ZFS are very high. md5 uses BLKSIZ (==
> 1k) size reads and does hardly any other system calls. For
> ZFS each syscall takes 76.4 microseconds while UFS2 syscalls
> are 2.7 us each! zpool iostat shows there is no IO to the
> real disk so this implies that for the HOLEY case zfs read
> calls have a significantly higher overhead or there is a bug.
>
> Basically C tests just confirm what we find in B tests.
Interesting. There are two problems. First is that cat(1) uses
st_blksize to find out best size of I/O request and we force it to
PAGE_SIZE, which is very, very wrong for ZFS - it should be equal to
recordsize. I need to find discussion about this:
/*
* According to www.opengroup.org, the meaning of st_blksize is
* "a filesystem-specific preferred I/O block size for this
* object. In some filesystem types, this may vary from file
* to file"
* Default to PAGE_SIZE after much discussion.
* XXX: min(PAGE_SIZE, vp->v_bufobj.bo_bsize) may be more
* correct.
*/
sb->st_blksize = PAGE_SIZE;
For example cp(1) just uses MAXBSIZE, which is also not really good, but
at least MAXBSIZE is much bigger than PAGE_SIZE (it's 64kB).
So bascially what you observed with cat(1) is equivalent of running
dd(1) with bs=4k.
I tested it on Solaris and this is not FreeBSD-specific problem, the
same is on Solaris. Is there a chance you could send your observations
to zfs-discuss at opensolaris.org, but just comparsion between dd(1) with
bs=128k and bs=4k (the other tests might be confusing).
--
Pawel Jakub Dawidek http://www.wheel.pl
pjd at FreeBSD.org http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 187 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20070503/9799905d/attachment.pgp
More information about the freebsd-fs
mailing list