Gmirror/graid or hardware raid?

Thu Jul 9 14:32:48 UTC 2015

On Jul 8, 2015, at 17:21, Charles Swiger <cswiger at mac.com> wrote:

> On Jul 8, 2015, at 12:49 PM, Mario Lobo <lobo at bsd.com.br> wrote:

<snip>

> Most of the PROD databases I know of working from local storage have heaps of
> RAID-1 mirrors, and sometimes larger volumes created as RAID-10 or RAID-50.
> Higher volume shops use dedicated SAN filers via redundant Fibre Channel mesh
> or similar for their database storage needs.

Many years ago I had a client buy a couple racks FULL of trays of 36 GB SCSI drives (yes, it was that long ago) and partition them so that they only used the first 1 GB of each. This was purely for performance. They were running a relatively large Oracle database and lots of OLTP transactions.

> 
>> I thought about zfs but I won't have lots of RAM avaliable
> 
> ZFS wants to be run against bare metal.  I've never seen anyone setup ZFS within
> a VM; it consumes far too much memory and it really wants to talk directly to the
> hardware for accurate error detection.

ZFS runs fine in a VM and the notion that it _needs_ lots of RAM is mostly false. I have run a FBSD Guest with ZFS and only 1 GB RAM.

But… ZFS is designed first and foremost for data reliability and not performance. It gets it’s performance from striping across many vdevs (the ZFS term for the top level device you assemble zpools out of), the ARC (adaptive reuse cache), and Logging devices. Striping requires many drives. The ARC uses any available RAM as a very aggressive FS cache. The Log device improves sync writes by committing them to a dedicated log device (usually a mirror of fast SSDs).

I generally use ZFS for the Host (and because of my familiarity with ZFS, I tend to use ZFS for all of the Host filesystems). Then I use UFS for the Guests _unless_ I might need to migrate data in or out of a VM or I need flexibility in partitioning (once you build a zpool, all zfs datasets in it can grab as much or as little space as they need). I can use zfs send / recv (even incrementally) to move data around quickly and easily. I generally turn on compression for VM datasets (I set up one zfs dataset per VM) as the CPU cost is noise and it dazes a bunch of space (and reduces physical disk I/O which also improves performance). I do NOT turn on compression in any ZFS inside a Guest as I am already compressing at the Host layer.

I also have a script that grabs a snapshot of every ZFS dataset every hour and replicates them over to my backup server. Since ZFS snapshots have no performance penalty, the only cost to keep them around is the space used. This has proven to be a lifesaver when a Guest is corrupted, I can easily and quickly roll it back to the most recent clean version.

> 
>> Should I use the controller raid? Gmirror/Graid? What raid level?
> 
> Level is easy: a 4-disk machine is suited for either a pair of RAID-1s, a 4-disk RAID-10 volume,
> or a 4-disk RAID-5 volume.

For ZFS, the number of vdev’s and the type will determine performance. For a vdev of the following type you can expect the listed performance. I am listing performance in terms of comparison to a single disk.

N-way mirror: write 1x, read 1*n
RaidZ: write 1x, read 1x minimum but variable

Note that the performance of a RaidZ vdev does NOT scale with the number of drives in the RAID set nor does it change with the Raid level (Z1, Z2, Z3).

So for example, a zpool consisting of 4 vdevs each a 2-way mirror will have 4x the write performance of a single drive and 8x the read performance. A zpool consisting of 2 vdevs each a RaidZ2 of 4 drives will have the 2x the write performance of single drive and the read performance will be a minimum of 2 x the performance of a single drive. The variable read performance of RaidZ is because RaidZ does not always write full strips across all the drives in the vdev. In other words, RaidZ is a variable width Raid system. This has advantages and disadvantages :-) Here is a good blog post that describes the RaidZ stripe width http://blog.delphix.com/matt/2014/06/06/zfs-stripe-width/ 

I do NOT use RaidZ for anything except bulk backup data where capacity is all that matters and performance is limited by lots of other factors.

I also create a “do-not-remove” dataset in every zpool with 1 GB reserved and quota. ZFS behaves very, very badly when FULL. This give me a cushion when things go badly so I can delete whatever used up all the space … Yes, ZFS cannot delete files if the FS is completely FULL. I leave the “do-not-remove” dataset unmounted so that it cannot be used.

Here is the config of my latest server (names changed to protect the guilty):

root at host1:~ # zpool status
  pool: rootpool
 state: ONLINE
  scan: none requested
config:

	NAME        STATE     READ WRITE CKSUM
	rootpool    ONLINE       0     0     0
	  mirror-0  ONLINE       0     0     0
	    ada2p3  ONLINE       0     0     0
	    ada3p3  ONLINE       0     0     0

errors: No known data errors

  pool: vm-001
 state: ONLINE
  scan: none requested
config:

	NAME                             STATE     READ WRITE CKSUM
	vm-001                           ONLINE       0     0     0
	  mirror-0                       ONLINE       0     0     0
	    diskid/DISK-WD-WMAYP2681136  ONLINE       0     0     0
	    diskid/DISK-WD-WMAYP3653359  ONLINE       0     0     0

errors: No known data errors
root at host1:~ # zfs list
NAME                       USED  AVAIL  REFER  MOUNTPOINT
rootpool                  35.0G   383G    19K  none
rootpool/ROOT             3.79G   383G    19K  none
rootpool/ROOT/2015-06-10     1K   383G  3.01G  /
rootpool/ROOT/default     3.79G   383G  3.08G  /
rootpool/do-not-remove      19K  1024M    19K  none
rootpool/software         18.6G   383G  18.6G  /software
rootpool/tmp              4.29G   383G  4.29G  /tmp
rootpool/usr              3.98G   383G    19K  /usr
rootpool/usr/home           19K   383G    19K  /usr/home
rootpool/usr/ports        3.63G   383G  3.63G  /usr/ports
rootpool/usr/src           361M   383G   359M  /usr/src
rootpool/var              3.20G   383G    19K  /var
rootpool/var/crash          19K   383G    19K  /var/crash
rootpool/var/log          38.5M   383G  1.19M  /var/log
rootpool/var/mail         42.5K   383G  30.5K  /var/mail
rootpool/var/tmp            19K   383G    19K  /var/tmp
rootpool/var/vbox         3.17G   383G  2.44G  /var/vbox
vm-001                     166G   283G    21K  /vm/local
vm-001/aaa-01             61.1G   283G  17.0G  /vm/local/aaa-01
vm-001/bbb-dev-01       20.8G   283G  13.1G  /vm/local/bbb-dev-01
vm-001/ccc-01        21.5K   283G  20.5K  /vm/local/ccc-01
vm-001/dev-01             4.10G   283G  3.19G  /vm/local/dev-01
vm-001/do-not-remove        19K  1024M    19K  none
vm-001/ddd-01          4.62G   283G  2.26G  /vm/local/ddd-01
vm-001/eee-dev-01         16.6G   283G  15.7G  /vm/local/eee-dev-01
vm-001/fff-01          7.44G   283G  3.79G  /vm/local/fff-01
vm-001/ggg-02           2.33G   283G  1.77G  /vm/local/ggg-02
vm-001/hhh-02             8.99G   283G  6.80G  /vm/local/hhh-02
vm-001/iii-repos          36.2G   283G  36.2G  /vm/local/iii-repos
vm-001/test-01            2.63G   283G  2.63G  /vm/local/test-01
vm-001/jjj-dev-01            19K   283G    19K  /vm/local/jjj-dev-01
root at host1:~ # 

--
Paul Kraus
paul at kraus-haus.org