[Fwd: Re: Large ZFS arrays?]

Sun Jun 15 16:00:26 UTC 2014

On Jun 15, 2014, at 10:43 AM, Dennis Glatting <dg at pki2.com> wrote:
> 
> Total. I am looking at three pieces in total:
> 
> * Two 1PT storage "blocks" providing load sharing and 
>  mirroring for failover.
> 
> * One 5PB storage block for on-line archives (3-5 years).
> 
> The 1PB nodes will divided into something that makes sense, such as
> multiple SuperMicro 847 chassis with 3TB disks providing some number of
> volumes. Division is a function of application, such as a 100TB RAIDz2
> volumes for bulk storage whereas smaller 8TB volumes for active data,
> such as iSCSI, databases, and home directories.
> 
> Thanks.

We’re currently using multiples of the SuperMicro 847 chassis with 3TB and 4TB drives, and LSI 9207 controllers. Each 45 drive array is configured as 4 11 drive raidz2 groups, plus one hot spare. 

A few notes:

1) I’d highly recommend against grouping them together into one giant zpool unless you really really have to. We just spent a lot of time redoing everything so that each 45 drive array is its own zpool/filesystem. You’re otherwise putting all your eggs into one very big basket, and if something went wrong you’d lose everything rather than just a subset of your data. If you don’t do this, you’ll almost definitely have to run with sync=disabled, or the number of sync requests hitting every drive will kill write performance.

2) You definitely want a JBOD controller instead of a smart RAID controller. The LSI 9207 works pretty well, but when you exceed 192 drives it complains on boot up of running out of heap space and makes you press a key to continue, which then works fine. There is a very recently released firmware update for the card that seems to fix this, but we haven’t completed testing yet. You’ll also want to increase hw.mps.max_chains. The driver warns you when you need to, but you need to reboot to change this, and you’re probably only going to discover this under heavy load.

3) We’ve played with L2ARC ssd devices, and aren’t seeing much gains. It appears that our active data set is so large that it’d need a huge SSD to even hit a small percentage of our frequently used files. setting “secondarycache=metadata” does seem to help a bit, but probably not worth the hassle for us. This probably will depend entirely on your workload though.

4) “zfs destroy” can be excruciatingly expensive on large datasets. http://blog.delphix.com/matt/2012/07/11/performance-of-zfs-destroy/  It’s a bit better now, but don’t assume you can “zfs destroy” without killing performance to everything.

If you have specific questions, I’m happy to help, but I think most of the advice I can offer is going to be workload specific. If I had to do it all over again, I’d probably break things down into many smaller servers than trying to put as much onto one.