{Disarmed} Re: {Disarmed} Re: Suggestion for hardware for ZFS fileserver

Sun Dec 23 01:52:01 UTC 2018

Can’t really give you some generic recommendations but on our Dell R730xd and R740xd servers we use the Dell HB330 SAS HBA card, also know as “Dell Storage Controller 12GB-SASHBA” that uses the “mpr” device driver. This is an LSI3008 based controller and works really well. Only internal drives on the Dell servers though (730xd and 740xd servers). Beware that this is not the same as the “H330” RAID controller that Dell normally sells you. We had to do a “special” in order to get the 10TB drives with 4K sectors with the HBA330 controller since Dell only would sell use the 10TB drives together with the H330 controller at the time we bought them. So we bought the HBA330:s separately and swapped them ourself... And then we had to do a low-level reformat of all the disks since Dell by default would deliver them formatted with a nonstandard sector size (4160 bytes I think, or perhaps 4112) and “Protection Information” enabled (used and understood by the H330 controller, but not FreeBSD when using HBAs). But that’s easy to fix (just takes an hour or so per drive to do):

	# sg_format —size=4096 —fmtpinfo=0 /dev/da0

On our HP servers we use the HP Smart HBA H241 controller in HBA mode (set via the BIOS configuration page) connected to external HP D6030 SAS shelfs (70 disks per shelf). This is a HP special one that uses the “ciss” driver. Also works fine.

- Peter

> On 22 Dec 2018, at 15:49, Sami Halabi <sodynet1 at gmail.com> wrote:
> 
> Hi,
> 
> What sas hba card do you recommend for 16/24 internal ports and 2 external that are recognized and work well with freebsd ZFS.
> Sami
> 
> בתאריך שבת, 22 בדצמ׳ 2018, 2:48, מאת Peter Eriksson <peter at ifm.liu.se <mailto:peter at ifm.liu.se>>:
> 
> 
> > On 22 Dec 2018, at 00:49, Rick Macklem <rmacklem at uoguelph.ca <mailto:rmacklem at uoguelph.ca>> wrote:
> > 
> > Peter Eriksson wrote:
> > [good stuff snipped]
> >> This has caused some interesting problems…
> >> 
> >> First thing we noticed was that booting would take forever… Mounting the 20-100k >filesystems _and_ enabling them to be shared via NFS is not done efficient at all (for >each filesystem it re-reads /etc/zfs/exports (a couple of times) befor appending one >line to the end. Repeat 20-100,000 times… Not to mention the big kernel lock for >NFS “hold all NFS activity while we flush and reinstalls all sharing information per >filesystem” being done by mountd…
> > Yes, /etc/exports and mountd were implemented in the 1980s, when a dozen
> > file systems would have been a large server. Scaling to 10,000 or more file
> > systems wasn't even conceivable back then.
> 
> Yeah, for a normal user with non-silly amounts of filesystems this is a non-issue. Anyway it’s the kind of issues that I kind of like to think about how to solve. It’s fun :-)
> 
> 
> >> Wish list item #1: A BerkeleyDB-based ’sharetab’ that replaces the horribly >slow /etc/zfs/exports text file.
> >> Wish list item #2: A reimplementation of mountd and the kernel interface to allow >a “diff” between the contents of the DB-based sharetab above be input into the >kernel instead of the brute-force way it’s done now..
> > The parser in mountd for /etc/exports is already an ugly beast and I think
> > implementing a "diff" version will be difficult, especially figuring out what needs
> > to be deleted.
> 
> Yeah, I tried to decode it (this summer) and I think I sort of got the hang of it eventually. 
> 
> 
> > I do have a couple of questions related to this:
> > 1 - Would your case work if there was an "add these lines to /etc/exports"?
> >     (Basically adding entries for file systems, but not trying to delete anything
> >      previously exported. I am not a ZFS guy, but I think ZFS just generates another
> >      exports file and then gets mountd to export everything again.)
> 
> Yeah, the ZFS library that the zfs commands use just reads and updates the separate /etc/zfs/exports text file (and have mountd read both /etc/exports and /etc/zfs/exports). The problem is that basically what it does when you tell it to “zfs mount -a” (mount all filesystems in all zpools) is a big (pseudocode):
> 
> For P in ZPOOLS; do
>   For Z in ZFILESYSTEMS-AND-SNAPSHOTS in $P; do
>     Mount $Z
>     If $Z Have “sharenfs” option; Then
>        Open /etc/zfs/exports
>        Read until you find a matching line, replace with the options, else if not found, Append options
>        Close /etc/zfs/exports
>        Signal mountd
>          (Which then opens /etc/exports and /etc/zfs/exports and does it’s magic)
>     End
>   End
> End
> 
> All wrapped up in a Solaris compatibility layer I libzfs. Actually I think it even reads the /etc/zfs/exports file twice for each loop iteration due to some abstractions. Btw things got really “fun” when the hourly snapshots we were taking (adding 10-20k new snapshots every hour, and we didn’t expire them fast enough in the beginning) triggered the code above and that code took longer than 1 hour to execute - mountd was 100% busy getting signalled and rereading, flushing and reinstalling exports into the kernel all the time) and basically never finished. Luckily we didn’t have an NFS clients accessing the servers at that time :-)
> 
> This summer I wrote some code to instead use a Btree BerkeleyDB file and modified the libzfs code and mountd daemon to instead use that database for much faster lookups (no need to read the whole /etc/zfs/exports file all the time) and additions. Worked pretty well actually and wasn’t that hard to add. Wanted to also add a possibility to add “exports” arguments “Solaris”-style so one could say things like:
> 
>         /export/staff   vers=4,sec=krb5:krb5i:krb5p,rw=MailScanner warning: numerical links are often malicious: 130.236.0.0/16,sec=sys,ro=130.236.160.0/24:10.1.2.3 <http://130.236.0.0/16,sec=sys,ro=130.236.160.0/24:10.1.2.3>
> 
> But I never finished that (solaris-style exports options) part….
> 
> We’ve lately been toying with putting the NFS sharing stuff into separate “private" ZFS attribute (separate from official “sharenfs” one) and have another tool to read them instead and generate another “exports” file so that file can be generated in “one go” and just signal mountd once after all filesystems have been mounted. Unfortunately that would mean that they won’t be shared until after all of them have been mounted but we think it would take less time all-in-all.
> 
> We also modified the FreeBSD boot scripts so that we make sure to first mount all most important ZFS filesystems that is needed on the boot disks (not just /) and then we mount (and share via NFS the rest in the background so we can login to the machine as root early (no need for everything to have been mounted before giving us a login prompt).
> 
> (Right now a reboot of the bigger servers take an hour or two before all filesystems are mounted and exported).
> 
> 
> > 2 - Are all (or maybe most) of these ZFS file systems exported with the same
> >      arguments?
> >      - Here I am thinking that a "default-for-all-ZFS-filesystems" line could be
> >         put in /etc/exports that would apply to all ZFS file systems not exported
> >         by explicit lines in the exports file(s).
> >      This would be fairly easy to implement and would avoid trying to handle
> >      1000s of entries.
> 
> For us most have exactly the same exports arguments. (We set options on the top level filsystems (/export/staff, /export/students etc) and then all home dirs inherit those.
> 
> > In particular, #2 above could be easily implemented on top of what is already
> > there, using a new type of line in /etc/exports and handling that as a special
> > case by the NFS server code, when no specific export for the file system to the
> > client is found.
> > 
> >> (I’ve written some code that implements item #1 above and it helps quite a bit. >Nothing near production quality yet though. I have looked at item #2 a bit too but >not done anything about it.)
> > [more good stuff snipped]
> > Btw, although I put the questions here, I think a separate thread discussing
> > how to scale to 10000+ file systems might be useful. (On freebsd-fs@ or
> > freebsd-current at . The latter sometimes gets the attention of more developers.)
> 
> Yeah, probably a good idea!
> 
> - Peter
> 
> > rick
> > 
> > 
> 
> _______________________________________________
> freebsd-fs at freebsd.org <mailto:freebsd-fs at freebsd.org> mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-fs <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd.org <mailto:freebsd-fs-unsubscribe at freebsd.org>"