bhyve: zvols for guest disk - yes or no?

Fri Nov 18 10:49:08 UTC 2016

On 18/11/2016 09:47, Miroslav Lachman wrote:
> Jan Bramkamp wrote on 2016/11/17 11:16:
>> On 16/11/2016 19:10, Patrick M. Hausen wrote:
>>>> Without ZFS you would require a reliable hardware RAID controller (if
>>>> such a magical creature exists) instead (or build a software RAID1+0
>>>> from gmirror and gstripe). IMO money is better invested into more RAM
>>>> keeping ZFS and the admin happy.
>>>
>>> And we always use geom_mirror with UFS ...
>>
>> That would work but I don't recommend for new setups. ZFS offers you a
>> lot of operation which in my opinion alone is worth the overhead.
>> Without ZFS you would have to use either large raw image files in UFS or
>> fight with an old fashioned volume manager.
>
> One thing to note - ZFS isn't holy grail and has own problems too.

Of course ZFS isn't perfect. Nothing as complex as ZFS could be.

> For example there is not fsck_zfs and there are some cases where you can end
> up with broken pool and because of its complexity the only thing you can
> do is to restore from backup.

Because ZFS takes a different approach to data and meta data integrity. 
By design ZFS should be able to automatically recover without dataloss 
from all cases a fsck_zfs could handle without user interaction. This is 
possible because ZFS is a Merkle-DAG (edges are stored inside nodes and 
contain the checksums of the referenced nodes) and stores multiple 
copies of important metadata (in addition to mirroring and RAID-Z).

Fsck on UFS includes a good amount of guesswork which usually works 
because UFS the on-disk data structures are a lot simpler. That way you 
end up with some state the kernel can mount without panic()ing, but it 
doesn't imply that its always exactly the state the users and 
applications expected the system to be in.

> This can occured on ZFS with higher
> probability than on simple UFS2.

Only if you pick your metrics with a strong bias in favor of UFS. The 
ZFS data structures are more complicated and you can't repair a ZFS pool 
with a hex editor and a pocket calculator. At the same time ZFS protects 
it data (including meta data) a lot better from corruption.

  * ZFS uses a copy on write B-tree with path copying instead of 
modifying live data in place.

  * Because the ZFS graph is directed and cycle-free its edges can (and 
do) contain the checksum of the pointed to node.

  * By default ZFS stores three copies of vital pool level meta data and 
two copies of dataset level meta data in addition to VDEV level redundancy.

  * The UFS and ZFS code has been battle tested in production enough years.

UFS is not suitable for today's large file systems. It trusts its 
backing storage to much. UFS can't protect from your data from 
undetected read errors because it doesn't store any checksums along the 
data. It can't help you detect phantom writes because there a are 
checksums in the edges. You could swap two blocks of file content with 
each other and UFS wouldn't notice.

The ratio of disk capacity to throughput has reached a point where it is 
no longer acceptable to run fsck at boot. UFS2 on FreeBSD offers 
soft-updates and snapshots which allow fsck to run in the background but 
this requires a lot of RAM and steals a lot IOPS from the other 
applications running on the system. Running with journaled soft-updates 
instead requires even more trust in notoriously lying disk, disk 
controllers and their caches. Additionally UFS snapshots and journaled 
soft-updates are incompatible and without snapshots you can't create 
consistent backups of your file systems.

UFS is a great file system for the hardware it was designed for, but 
hardware evolved and now we have to deal with orders of magnitude more 
storage on disks which haven gotten a lot more reliable. There are still 
use-cases for UFS and it is a good fit for small systems even if most of 
these systems could use a NAND flash optimized file system as well.

-- Jan Bramkamp