Re: zfs replication tool

From: Paul Mather <paul_at_gromit.dlib.vt.edu>
Date: Fri, 23 Sep 2022 13:48:31 UTC
On Sep 20, 2022, at 8:20 AM, Julien Cigar <julien@perdition.city <mailto:julien@perdition.city>> wrote:

> On Tue, Sep 20, 2022 at 11:29:05AM +0200, Julien Cigar wrote:
>> On Fri, Sep 16, 2022 at 04:02:36PM +0200, Julien Cigar wrote:
>>> On Fri, Sep 16, 2022 at 09:56:36AM -0400, mike tancsa wrote:
>>>> On 9/16/2022 9:49 AM, Julien Cigar wrote:
>>>>> sysutils/zrepl works really well for me.
>>>>>> Check out the filter syntax to see if it meets your requirements
>>>>>> 
>>>>>> https://zrepl.github.io/configuration/filter_syntax.html <https://zrepl.github.io/configuration/filter_syntax.html>
>>>>>> 
>>>>>>     ---Mike
>>>>> thanks, I used zrepl in the past and I experienced some deadlocks and
>>>>> crashes which I why I switched to sanoid (which doesn't support
>>>>> recursivity without zfs snapshot -r)
>>>> 
>>>> Those deadlocks / crashes (if they are the ones I was thinking about) were
>>>> FreeBSD bugs in the end
>>>> 
>>>> https://github.com/freebsd/freebsd-src/commit/1820ca2154611d6f27ce5a5fdd561a16ac54fdd8 <https://github.com/freebsd/freebsd-src/commit/1820ca2154611d6f27ce5a5fdd561a16ac54fdd8>
>>>> 
>>>> https://github.com/zrepl/zrepl/issues/411#issuecomment-821878812 <https://github.com/zrepl/zrepl/issues/411#issuecomment-821878812>
>>>> 
>>>> Its been rock solid for me since those commits / fixes
>>> 
>>> ok, I'll give zrepl another chance :) thanks for pointing this!
>> 
>> it looks like zrepl snapshots aren't atomic across datasets too. I'm
>> testing on a local "test" machine and it gives me https://gist.github.com/silenius/b8aaf68dae5c941397df44184cd33d7b <https://gist.github.com/silenius/b8aaf68dae5c941397df44184cd33d7b>
> 
> also the thing I don't like with zrepl is that snapshot management and
> replication are tightly coupled. It looks like replicating a host "A" to
> "B" and "C" (classical local and off-site backup) is not possible
> without dirty hacks and race conditions ...


I like zrepl on the whole but it has some annoying quirks and limitations currently that, although I use it for daily replications, make me wish these issues could be addressed:

1) Although you can specify a snapshot prefix for pruning purposes, zrepl selects datasets for replication. I discovered that all snapshots on those datasets are replicated, not just the ones you want stewarded by zrepl.  In my case, I also use Tivoli TSM (now Spectrum Protect) to back up a system, and make a snapshot (for consistency), which is backed up.  (The snapshot is deleted after the backup finishes.)  I found that zrepl runs were picking up this ephemeral snapshot during the pull job and then getting into a tumult (with PLANNING-ERRORs) when this snapshot disappeared.  My "solution" for now is to run my pull job hourly via cron instead of zrepl's inbuilt timer and to have cron not run the job during the time window of the backup (so it won't pick up the TSM snapshot).  My retention is such that zrepl can "catch up" for the period it misses, replicating before those snapshots would be pruned.

This problem is related to this zrepl issue: https://github.com/zrepl/zrepl/issues/403 <https://github.com/zrepl/zrepl/issues/403>, opened in late 2020 and still not resolved.

2) Related to 1) above, replicated boot environments cause problems when I delete them (which is usually after I've successfully upgraded).  It leaves a dangling snapshot hold on the receiver side, which I need to clean up manually.

Maybe I'm not understanding or configuring zrepl correctly, but it does seem from Issue #403 that zrepl's promiscuous replication of all snapshots is indeed a thing and can lead to problems.

Cheers,

Paul.