Re: Does a failed separate ZIL disk mean the entire zpool is lost?

From: andy thomas <andy_at_time-domain.co.uk>
Date: Tue, 10 Sep 2024 10:35:25 UTC
Thank you but I'm afraid I didn't use two mirrored ZIL devices since I 
didn't know this was possible at the time I set this server up (late 2017 
and before I was even aware of the 'FreeBSD Mastery: ZFS' book!) And there 
were no spare disk bays in the server's chassis to add another device and 
at the time PCIe > nvme adapters were not available. For data resilience I 
relied on an identical mirror server in the same rack linked via a 2 x 
10GBit/sec bonded point-to-point network link but this server also failed 
in the data centre melt-down...

It looks like the data is now lost so I won't waste any more time trying 
to recover it - this incident will hopefully persuade my employer to heed 
advice given years ago regarding locating mirror servers in a different 
data centre linked by a fast multi-gigabit connection.

Andy

PS: the ZFS and Advanced ZFS books are truly excellent, by the way!

On Mon, 9 Sep 2024, Allan Jude wrote:

> As the last person mentioned, you should be able to import with the -m flag, 
> and only lose about 5 seconds worth of writes.
>
> The pool is already partially imported at boot by the other mechanisms, you 
> might need to disable that to prevent the partial import at boot, so you can 
> do the manual import.
>
> On 2024-09-09 12:20 p.m., infoomatic wrote:
>> did you use two mirrored ZIL devices?
>> 
>> You can "zpool import -m", but you will probably be confronted with some
>> errors - you will probably lose the data the ZIL has not committed, but
>> most of your data in your pool should be there
>> 
>> 
>> On 09.09.24 17:51, andy thomas wrote:
>>> A server I look after had a 65TB ZFS RAIDz1 pool with 8 x 8TB hard disks
>>> plus one hot spare and separate ZFS intent log (ZIL) and L2ARC cache
>>> disks that used a pair of 256GB SSDs. This ran really well for 6 years
>>> until 2 weeks ago, when the main cooling system in the data centre where
>>> it was installed failed and the backup cooling system failed to start up.
>>> 
>>> The upshot was the ZIL SSD went short-circuit across its power
>>> connector, shorting out the server's PSUs and shutting down the server.
>>> After replacing the failed SSD and verifying all the spinning hard disks
>>> and the cache SSD are undamaged, attempts to import the pool fail with
>>> the following message:
>>> 
>>> NAME       SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP DEDUP
>>> HEALTH  ALTROOT
>>> clustor2      -      -      -        -         -      - -      -
>>> UNAVAIL  -
>>> 
>>> Does this mean the pool's contents are now lost and unrecoverable?
>>> 
>>> Andy
>>> 
>> 
>
>


----------------------------
Andy Thomas,
Time Domain Systems

Tel: +44 (0)7866 556626
http://www.time-domain.co.uk