Reading a corrupted file on ZFS

Fri Feb 12 19:49:05 UTC 2021

On 2/12/2021 13:26, Artem Kuchin wrote:
> 12.02.2021 19:37, Karl Denninger пишет:
>> On 2/12/2021 11:22, Artem Kuchin wrote:
>>>
>>> This is frustrating. why..why..
>>
>> You created a synthetic situation that in the real world almost-never 
>> exists (ONE byte modified in all copies in the same allocation block 
>> but all other data in that block is intact and recoverable.)
>>
> I could be 1 GB file with ZFS wisth block size of 1MB and with rotten 
> bits within the same 1MB of block on different disks. How i did it is 
> not important, life is unpredictable, i'm not trying to avoid 
> everything. The question is what to do when it happens. And currently 
> the answer is - nothing.
>
The answer to a problem that does not present itself statistically 
speaking in the real world is the last one you worry about.  Worry about 
all the other ones and cover them first.

I have had literal *hundreds* of drives fail in my time in IT.  I used 
to go change them in office machines (IBM PCs and clones -- the original 
ones) 2-3 times a week at one of our larger customers.  This was in the 
day of 2,7RLL encoding if you were lucky for capacity reasons (MFM if 
not)..... yeah, that far back and before.  Of course when you have a 
building full of these things as they did they break on a fairly regular 
basis and hopefully you've thought the implications of that out and how 
to recover from it.

The *least* disruptive failures were sector-sized and occasionally I'd 
get the frantic phone call from some place that had no backups and was 
utterly desperate because something like their master inventory file was 
on that disk.  There were times I was able to recover it with a lot of 
work and luck but when it was truly hosed and no amount of being willing 
to wait would get you one good read I was *never* able to narrow the 
data loss down to less than one sector irrespective of what the customer 
was willing to pay.  Ever.  It sure wasn't cheap having actual recovery 
done (the clean-room style places can do it too, but the same issue 
applies there; gone is gone, like it or not.)

Nowdays modern drives frequently claim they have 63 (or 255) 
sectors/track but that's bollocks; their actual internal organization is 
completely opaque beyond the firmware in the drive in no small part 
because a larger circumference can hold more data bits in a given 
encoding than a smaller one, so the physical sectors on a given track is 
not constant.  Most modern "larger" disks have several *hundred* sectors 
per physical track, in many cases they are physically 4k sectors and 
loss of servo information can result in the entire track being 
unrecoverable.

As sizes go up so does the size of data at risk from a given event.  
It's inherent in complex encoding schemes; the more you encode in a 
symbol the more you lose when the symbol is corrupted to the point it 
cannot be recovered.

128kb is all you lost?  How quaint.

>> In almost-all actual cases of "bit rot" it's exactly that; random and 
>> by statistics extraordinarily unlikely to hit all copies at once in 
>> the same allocation block.  Therefore, ZFS can and does fix it; UFS 
>> or FAT silently returns the corrupted data, propagates it, and 
>> eventually screws you down the road.
>
> In active fs you are right. But if this is a storage disk with movies 
> and photos, then i can just checksum all files with a little script 
> and recheck once in a while. So, for storage
>
> perposes i have all ZFS postitives and also can read as much data as i 
> can. Because for long time storage it is more important to have 
> ability read the data in any case.
>
I do that now with ZFS.  In some cases (long-term online but 
nearly-read/only "cool" storage that is spun down when not being 
accessed) I have backups that are only mounted and verified once a 
year.  Mount the backup volumes and scrub them, for this specific reason 
-- to detect the very unlikely but possible chance that two bits will 
get flipped in the identical checksummed domain (ZFS block size) on two 
or more copies.  If it happens you're hosed and "cold" storage (disks in 
a vault for a year) is where patrol scrubs are most-likely to miss it 
because you're not doing them on a regular basis since the disks are 
physically in a different place.  Those backups are for all intents and 
purposes a fire insurance policy.

I've yet to need them but that doesn't mean I'll stop maintaining them.  
I've also yet to have one fail verification but I'm sure it'll 
eventually happen; that's why THOSE are redundant too.

>>
>> The nearly-every-case situation in the real world where a disk goes 
>> physically bad (I've had this happen *dozens* of times over my IT 
>> career) results in the drive being unable to 
>
> *NEARLY* is not good enough for me.
>
Then make backups and make THOSE ZFS mirrors (which is what I do), 
dismount them and put them in a second location.  That's what backups 
are for and ZFS makes that pretty easy and efficient with snapshots and 
send/receive.  ZFS also makes segmenting data into "live" (updated a lot 
and currently), "cool" (updated once in a while) and "online but cold" 
(either R/O or close to it) and moving data from one classification to 
another (which happens often over time) and that in turn allows you to 
*easily* build a backup strategy that works for all of that and yet 
doesn't involve trying to cart a petabyte around in your car to and from 
the bank vault.  It also allows a snapshot capacity tailored to each 
requirement so an "oops" (as opposed to hardware failures) can be 
trivially recovered from which is not easy to do with UFS at all.

If you have a 0.001 (1 in a thousand) risk over some period of time and 
and have a second copy off-site with the same risk the odds of both 
failing at the same time by other than physical calamity that takes out 
both locations are multiplicative. Increase your redundancy as required 
until you're comfortable.  If "Tsar Bomba" gets dropped on my location, 
well..... :-)

I've had the ugly and wildly-improbable nightmare scenario happen to me 
in real life when I ran MCSNet -- a disk adapter that went insane 
internally and scribbled on *every attached device* at once.  It 
destroyed the data on ALL of them.  If I didn't have a backup I would 
have been *done* right there.  This was before ZFS and the pucker factor 
on that was considerable.

>> return the block at all; 
>
> You mix device blocks and ZFS block. As far as i remember default ZFS 
> block for checksumming is 16K and for big files storage better to have 
> it around 128K.
No I don't.  A block is a block is a block; they're different sizes 
depending on application and where in the stack of "things" between the 
physical disk and the application doing the reading or writing. In some 
cases I run small block sizes on ZFS, in others large block sizes.  
Depends on the workload.  But I have control of that; I have no control 
over what the physical media does inside its firmware. In some cases its 
idea of a "block" is small, in others its large, and in still others 
(SSDs in particular) how big it is depends on which block.  Scramble an 
allocation table in an SSD's controller and frequently everything on the 
drive is gone *at once.*  The scope of damage if you get an uncorrected 
error off a given media depends on many things (e.g. encoding in use, 
the media's sector size, is the error in the data or is in the servo 
information on the platter if the device is spinning rust, if it's an 
SSD is the error in the data block or is it in the allocation/mapping 
table storage, etc.)
>> In short there are very, very few actual "in the wild" failures where 
>> one byte is damaged and the rest surrounding that one byte is intact 
>> and retrievable.  In most cases where an actual failure occurs the 
>> unreadable data constitutes *at least* a physical sector.
>>
> "very very few" is enough for me to think about.
Then keep backups and make sure THEY are redundant too.  Do patrol scans 
(e.g. scrubs) on those on some reasonable basis so you are comfortable 
that the backups are good.
>
> One more thing. If you have one bad byte in a block of 16K and you 
> have checksum and recalculate it then it is quite possible to just 
> brute force every byte to match the checksum, thus restoring the data.

True.  But as I said across an unbelievable number of failures that I've 
seen in my close to 40 years doing this stuff I've yet to have *one* 
instance in which a media failure took out less than at least one 
physical sector of the device and if it does, I still have one more 
error of the *same sort* that has to happen in the same logical block on 
the other vdev of the mirror, otherwise it gets detected and fixed with 
no harm and no foul.  I'm sure it's possible for me to get screwed but 
then again it's also possible for me to get hit by an asteroid while 
getting my mail -- it's just statistically improbable to the point that 
I don't worry about it very much.

I'm a LOT more worried about a failure on enough *other* vdevs during a 
resilver to lose redundancy AND data, which can really hose you. That is 
a hell of a lot more-likely statistically and if it happens you better 
have backups because that second failure, being unrelated to the first, 
hoses you *hard*.

>
> If you have mirror with two different bytes then bute forcing is even 
> ether,
>
> Somehow, ZFS slaps my hands and does not allow to be sure that i can 
> restore data when i needed it and decide myself if it is okay or not.
>
> For long time storage of big files it now seems better to store it on 
> UFS mirror, checksum each 512bytes blocks of files and store checksums 
> separetelly and run monthly/weekly "scrub". This way i would sleep 
> better.
>
For the love of God no.  For openers how do you detect a corrupted 
checksum file and know it's *the file* as opposed to the data?

There are good reasons to run UFS instead of ZFS in many cases but this 
is definitely not one of them.

-- 
Karl Denninger
karl at denninger.net <mailto:karl at denninger.net>
/The Market Ticker/
/[S/MIME encrypted email preferred]/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4897 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20210212/6ab1820f/attachment-0001.bin>