HAST + ZFS + NFS + CARP

Thu Aug 11 09:10:22 UTC 2016

On Thu, Aug 11, 2016 at 10:11:15AM +0200, Borja Marcos wrote:
> 
> > On 04 Jul 2016, at 21:31, Julien Cigar <julien at perdition.city> wrote:
> > 
> >> To get specific again, I am not sure I would do what you are contemplating given your circumstances since it’s not the cheapest / simplest solution.  The cheapest / simplest solution would be to create 2 small ZFS servers and simply do zfs snapshot replication between them at periodic intervals, so you have a backup copy of the data for maximum safety as well as a physically separate server in case one goes down hard.  Disk storage is the cheap part now, particularly if you have data redundancy and can therefore use inexpensive disks, and ZFS replication is certainly “good enough” for disaster recovery.  As others have said, adding additional layers will only increase the overall fragility of the solution, and “fragile” is kind of the last thing you need when you’re frantically trying to deal with a server that has gone down for what could be any number of reasons.
> >> 
> >> I, for example, use a pair of FreeNAS Minis at home to store all my media and they work fine at minimal cost.  I use one as the primary server that talks to all of the VMWare / Plex / iTunes server applications (and serves as a backup device for all my iDevices) and it replicates the entire pool to another secondary server that can be pushed into service as the primary if the first one loses a power supply / catches fire / loses more than 1 drive at a time / etc.  Since I have a backup, I can also just use RAIDZ1 for the 4x4Tb drive configuration on the primary and get a good storage / redundancy ratio (I can lose a single drive without data loss but am also not wasting a lot of storage on parity).
> > 
> > You're right, I'll definitively reconsider the zfs send / zfs receive
> > approach.
> 
> Sorry to be so late to the party.
> 
> Unless you have a *hard* requirement for synchronous replication, I would avoid it like the plague. Synchronous replication sounds sexy, but it
> has several disadvantages: Complexity and in case you wish to keep an off-site replica it will definitely impact performance. Distance will
> increase delay.
> 
> Asynchronous replication with ZFS has several advantages, however.
> 
> First and foremost: the snapshot-replicate approach is a terrific short-term “backup” solution that will allow you to recover quickly from some
> often too quickly incidents, like your own software corrupting data. A ZFS snapshot is trivial to roll back and it won’t involve a costly “backup
> recovery” procedure. You can do both replication *and* keep some snapshot retention policy àla Apple’s Time Machine. 
> 
> Second: I mentioned distance when keeping off-site replicas, as distance necessarily increases delay. Asynchronous replication doesn´t have that problem.
> 
> Third: With some care you can do a one to N replication, even involving different replication frequencies.
> 
> Several years ago, in 2009 I think, I set up a system that worked quite well. It was based on NFS and ZFS. The requirements were a bit particular,
> which in this case greatly simplified it for me.
> 
> I had a farm of front-end web servers (running Apache) that took all of the content from a NFS server. The NFS server used ZFS as the file system. This might not be useful for everyone, but in this case the web servers were CPU bound due to plenty of PHP crap. As the front ends weren’t supposed to write to the file server (and indeed it was undesirable for security reasons) I could afford to export the NFS file systems in read-only mode. 
> 
> The server was replicated to a sibling in 1 or 2 minute intervals, I don’t remember. And the interesting part was this. I used Heartbeat to decide which of the servers was the master. When Heartbeat decided which one was the master, a specific IP address was assigned to it, starting the NFS service. So, the front-ends would happily mount it.
> 
> What happened in case of a server failure? 
> 
> Heartbeat would detect it in a minute more or less. Assuming a master failure, the former slave would become master, assigning itself the NFS
> server IP address and starting up NFS. Meanwhile, the front-ends had a silly script running in 1 minute intervals that simply read a file from the
> NFS mounted filesystem. In case there was a reading error it would force an unmount of the NFS server and it would enter a loop trying to mount it again until it succeeded.
> 
> It looks kludgy, but that means that in case of a server loss (ZFS on FreeBSD wasn’t that stable at the time and we suffered a couple of them) the website was titsup for maybe two minutes, recovering automatically. It worked. 
> 
> Both NFS servers were in the same datacenter, but I could have added geographical dispersion by using BGP to announce the NFS IP address to our routers. 
> 
> There are better solutions, but this one involved no fancy software licenses, no expensive hardware and it was quite reliable. The only problem we had was, maybe I was just too daring, we were bitten by a ZFS deadlock bug several times. But it worked anyway.
> 
> 

As I said in a previous post I tested the zfs send/receive approach (with
zrep) and it works (more or less) perfectly.. so I concur in all what you
said, especially about off-site replicate and synchronous replication.

Out of curiosity I'm also testing a ZFS + iSCSI + CARP at the moment, 
I'm in the early tests, haven't done any heavy writes yet, but ATM it 
works as expected, I havent' managed to corrupt the zpool.

I think that with the following assumptions the failover from MASTER
(old master) -> BACKUP (new master) can be done quite safely (the
opposite *MUST* always be done manually IMHO):
1) Don't mount the zpool at boot
2) Ensure that the failover script is not executed at boot
3) Once the failover script has been executed and that the BACKUP is 
the new MASTER assume that it will remain so, unless changed manually

This is to avoid the case of a catastrophic power loss in the DC and a
possible split-brain scenario when they both go off / on simultaneously.
2) is especially important with CARPed interface where the state could
flip from BACKUP -> MASTER -> BACKUP at boot sometimes.
For 3) you must adapt the advskew if the CARPed interface, so that even
if the BACKUP (now master) has an unplanned shutdown/reboot the old
MASTER (now backup) doesn't switch, unless done manually. So you should
do something like:
sysrc ifconfig_bge0_alias0="vhid 54 advskew 10 pass xxx alias
192.168.10.15/32"
ifconfig bge0 vhid 54 advskew 10

in the failover script (where the "new" advskew (10) is smaller than 
the old master (now backup) advskew)

The failover should only be done for unplanned events, so if you reboot
the MASTER for some reasons (freebsd-update, etc) the failover script on
the BACKUP should handle that.

(more soon...)

Julien

> 
> 
> Borja.
> 
> 
> 

-- 
Julien Cigar
Belgian Biodiversity Platform (http://www.biodiversity.be)
PGP fingerprint: EEF9 F697 4B68 D275 7B11  6A25 B2BB 3710 A204 23C0
No trees were killed in the creation of this message.
However, many electrons were terribly inconvenienced.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20160811/3e9c8af6/attachment.sig>