Large ZFS arrays?
Graham Allan
allan at physics.umn.edu
Fri Jun 20 15:25:54 UTC 2014
On 6/15/2014 10:28 AM, Dennis Glatting wrote:
> Anyone built a large ZFS infrastructures (PB size) and care to share
> words of wisdom?
This is a bit of a late response but I wanted to put in our "me too"
before I forget...
We have about 500TB of storage on ZFS at present, and plan to add 600TB
more later this summer, mostly in similar arrangements to what I've seen
discussed already - using Supermicro 847 JBOD chassis and a mixture of
Dell R710/R720 head nodes, with LSI 9200-8e HBAs. One R720 has four 847
chassis attached, a couple R710s just have a single chassis. We
originally installed one HBA in the R720 for each chassis but had some
deadlock problems at one point, which was resolved by daisy-chaining the
chassis from a single HBA. I had a feeling it was maybe related to
kern/177536 but not really sure.
We've been running FreeBSD 9.1 on all the production nodes, though I've
long wanted to (and am now beginning to) set up a reasonable long-term
testing box where we could check out some of the kernel patches or
tuning suggestions which come up - also beginning to test the 9.3
release for the next set of servers.
We built all these conservatively with each chassis as a separate pool,
each having four 10-drive raidz2 vdevs, a couple of spares, a cheapish
L2ARC SSD and a mirrored pair of ZIL SSD (maybe unnecessary to mirror
this these days?). I was using the Intel 24GB SLC drive for the ZIL,
will need to choose something new for future pools.
Would be interesting to hear a little about experiences with the drives
used... For our first "experimental" chassis we used 3TB Seagate desktop
drives - cheap but not the best choice, 18 months later they are
dropping like flies (luckily we can risk some cheapness here as most of
our data can be re-transferred from other sites if needed). Another
chassis has 2TB WD RE4 enterprise drives (no problems), and four others
have 3TB and 4TB WD "Red" NAS drives... which are another "slightly
risky" selection but so far have been very solid (also in some casual
discussion with a WD field engineer he seemed to feel these would be
fine for both ZFS and hadoop use).
Tracking drives for failures and replacements was a big issue for us.
One of my co-workers wrote a nice perl script which periodically
harvests all the data from the chassis (via sg3utils) and stores the
mappings of chassis slots, da devices, drive labels, etc into a
database. It also understands the layout of the 847 chassis and labels
the drives for us according to some rules we made up - we do some prefix
for the pool name, then "f" or "b" for front/back of chassis, then the
slot number, and finally (?) has some controls to turn the chassis drive
identify lights on or off. There might be other ways to do all this but
we didn't find any, so it's been incredibly useful for us.
As far as performance goes we've been pretty happy. Some of these get
relatively hammered by NFS i/o from cluster compute jobs (maybe ~1200
processes on 100 nodes) and they have held up much better than our RHEL
NFS servers using fiber channel RAID storage. We've also performed a few
bulk transfers between hadoop and ZFS (using distcp with an NFS
destination) and saw sustained 5Gbps write speeds (which really
surprised me).
I think that's all I've got for now.
Graham
--
-------------------------------------------------------------------------
Graham Allan
School of Physics and Astronomy - University of Minnesota
-------------------------------------------------------------------------
More information about the freebsd-fs
mailing list