ZFS and large directories - caveat report
Bob Friesenhahn
bfriesen at simple.dallas.tx.us
Thu Jul 21 22:03:15 UTC 2011
On Thu, 21 Jul 2011, Freddie Cash wrote:
>>
> The recordsize property in ZFS is the "max" block size used. It is not the
> only block size used for a dataset. ZFS will use any block size from 0.5 KB
> to $recordsize KB, as determined by the size of the file to be written (it
> tries to the find the recordsize that most closely matches the file size to
> use the least number of blocks per write).
Except for tail blocks (last block in a file), the uncompressed data
block size will always be the "max" block size. When compression is
enabled, that "max" block size is likely to be reduced to something
smaller (due to the compression), and zfs will use a smaller block
size on disk. This approach minimizes the performance impact from
fragmentation, copy on write (COW), and block metadata.
It would not make sense for zfs to behave as you describe since files
are written starting from scratch and so zfs has no knowledge of the
final file size until it is completely written (and even then, more
data could be written, or the file might be truncated). Zfs could
have knowledge of a file size if the application did a seek to the
ultimate length and wrote something, or used ftruncate to set the
size, but the file size can still be arbitrarily changed.
When raidzN is used, the data block is split into smaller chunks which
are distributed among the disks. When mirroring is used, full blocks
are written to each disk.
It is important to realize that the zfs block checksum is for the
uncompressed/unsplit original data block and not for some bit of data
which eventually ended up on a disk. For example, when raidz is used,
there is no independent checksum for the data chunks distributed
across the disks. The zfs approach assures end-to-end validation and
avoids having to recompute all data checksums (perhaps incorrectly)
when doing 'zfs send'.
Zfs metadata sizes are not related to the zfs block size.
Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
More information about the freebsd-fs
mailing list