Large ZFS arrays?

Graham Allan allan at physics.umn.edu
Fri Jun 20 15:25:54 UTC 2014


On 6/15/2014 10:28 AM, Dennis Glatting wrote:
> Anyone built a large ZFS infrastructures (PB size) and care to share
> words of wisdom?

This is a bit of a late response but I wanted to put in our "me too" 
before I forget...

We have about 500TB of storage on ZFS at present, and plan to add 600TB 
more later this summer, mostly in similar arrangements to what I've seen 
discussed already - using Supermicro 847 JBOD chassis and a mixture of 
Dell R710/R720 head nodes, with LSI 9200-8e HBAs. One R720 has four 847 
chassis attached, a couple R710s just have a single chassis. We 
originally installed one HBA in the R720 for each chassis but had some 
deadlock problems at one point, which was resolved by daisy-chaining the 
chassis from a single HBA. I had a feeling it was maybe related to 
kern/177536 but not really sure.

We've been running FreeBSD 9.1 on all the production nodes, though I've 
long wanted to (and am now beginning to) set up a reasonable long-term 
testing box where we could check out some of the kernel patches or 
tuning suggestions which come up - also beginning to test the 9.3 
release for the next set of servers.

We built all these conservatively with each chassis as a separate pool, 
each having four 10-drive raidz2 vdevs, a couple of spares, a cheapish 
L2ARC SSD and a mirrored pair of ZIL SSD (maybe unnecessary to mirror 
this these days?). I was using the Intel 24GB SLC drive for the ZIL, 
will need to choose something new for future pools.

Would be interesting to hear a little about experiences with the drives 
used... For our first "experimental" chassis we used 3TB Seagate desktop 
drives - cheap but not the best choice, 18 months later they are 
dropping like flies (luckily we can risk some cheapness here as most of 
our data can be re-transferred from other sites if needed). Another 
chassis has 2TB WD RE4 enterprise drives (no problems), and four others 
have 3TB and 4TB WD "Red" NAS drives... which are another "slightly 
risky" selection but so far have been very solid (also in some casual 
discussion with a WD field engineer he seemed to feel these would be 
fine for both ZFS and hadoop use).

Tracking drives for failures and replacements was a big issue for us. 
One of my co-workers wrote a nice perl script which periodically 
harvests all the data from the chassis (via sg3utils) and stores the 
mappings of chassis slots, da devices, drive labels, etc into a 
database. It also understands the layout of the 847 chassis and labels 
the drives for us according to some rules we made up - we do some prefix 
for the pool name, then "f" or "b" for front/back of chassis, then the 
slot number, and finally (?) has some controls to turn the chassis drive 
identify lights on or off. There might be other ways to do all this but 
we didn't find any, so it's been incredibly useful for us.

As far as performance goes we've been pretty happy. Some of these get 
relatively hammered by NFS i/o from cluster compute jobs (maybe ~1200 
processes on 100 nodes) and they have held up much better than our RHEL 
NFS servers using fiber channel RAID storage. We've also performed a few 
bulk transfers between hadoop and ZFS (using distcp with an NFS 
destination) and saw sustained 5Gbps write speeds (which really 
surprised me).

I think that's all I've got for now.

Graham
-- 
-------------------------------------------------------------------------
Graham Allan
School of Physics and Astronomy - University of Minnesota
-------------------------------------------------------------------------


More information about the freebsd-fs mailing list