Reading a corrupted file on ZFS
Karl Denninger
karl at denninger.net
Fri Feb 12 19:49:05 UTC 2021
On 2/12/2021 13:26, Artem Kuchin wrote:
> 12.02.2021 19:37, Karl Denninger пишет:
>> On 2/12/2021 11:22, Artem Kuchin wrote:
>>>
>>> This is frustrating. why..why..
>>
>> You created a synthetic situation that in the real world almost-never
>> exists (ONE byte modified in all copies in the same allocation block
>> but all other data in that block is intact and recoverable.)
>>
> I could be 1 GB file with ZFS wisth block size of 1MB and with rotten
> bits within the same 1MB of block on different disks. How i did it is
> not important, life is unpredictable, i'm not trying to avoid
> everything. The question is what to do when it happens. And currently
> the answer is - nothing.
>
The answer to a problem that does not present itself statistically
speaking in the real world is the last one you worry about. Worry about
all the other ones and cover them first.
I have had literal *hundreds* of drives fail in my time in IT. I used
to go change them in office machines (IBM PCs and clones -- the original
ones) 2-3 times a week at one of our larger customers. This was in the
day of 2,7RLL encoding if you were lucky for capacity reasons (MFM if
not)..... yeah, that far back and before. Of course when you have a
building full of these things as they did they break on a fairly regular
basis and hopefully you've thought the implications of that out and how
to recover from it.
The *least* disruptive failures were sector-sized and occasionally I'd
get the frantic phone call from some place that had no backups and was
utterly desperate because something like their master inventory file was
on that disk. There were times I was able to recover it with a lot of
work and luck but when it was truly hosed and no amount of being willing
to wait would get you one good read I was *never* able to narrow the
data loss down to less than one sector irrespective of what the customer
was willing to pay. Ever. It sure wasn't cheap having actual recovery
done (the clean-room style places can do it too, but the same issue
applies there; gone is gone, like it or not.)
Nowdays modern drives frequently claim they have 63 (or 255)
sectors/track but that's bollocks; their actual internal organization is
completely opaque beyond the firmware in the drive in no small part
because a larger circumference can hold more data bits in a given
encoding than a smaller one, so the physical sectors on a given track is
not constant. Most modern "larger" disks have several *hundred* sectors
per physical track, in many cases they are physically 4k sectors and
loss of servo information can result in the entire track being
unrecoverable.
As sizes go up so does the size of data at risk from a given event.
It's inherent in complex encoding schemes; the more you encode in a
symbol the more you lose when the symbol is corrupted to the point it
cannot be recovered.
128kb is all you lost? How quaint.
>> In almost-all actual cases of "bit rot" it's exactly that; random and
>> by statistics extraordinarily unlikely to hit all copies at once in
>> the same allocation block. Therefore, ZFS can and does fix it; UFS
>> or FAT silently returns the corrupted data, propagates it, and
>> eventually screws you down the road.
>
> In active fs you are right. But if this is a storage disk with movies
> and photos, then i can just checksum all files with a little script
> and recheck once in a while. So, for storage
>
> perposes i have all ZFS postitives and also can read as much data as i
> can. Because for long time storage it is more important to have
> ability read the data in any case.
>
I do that now with ZFS. In some cases (long-term online but
nearly-read/only "cool" storage that is spun down when not being
accessed) I have backups that are only mounted and verified once a
year. Mount the backup volumes and scrub them, for this specific reason
-- to detect the very unlikely but possible chance that two bits will
get flipped in the identical checksummed domain (ZFS block size) on two
or more copies. If it happens you're hosed and "cold" storage (disks in
a vault for a year) is where patrol scrubs are most-likely to miss it
because you're not doing them on a regular basis since the disks are
physically in a different place. Those backups are for all intents and
purposes a fire insurance policy.
I've yet to need them but that doesn't mean I'll stop maintaining them.
I've also yet to have one fail verification but I'm sure it'll
eventually happen; that's why THOSE are redundant too.
>>
>> The nearly-every-case situation in the real world where a disk goes
>> physically bad (I've had this happen *dozens* of times over my IT
>> career) results in the drive being unable to
>
> *NEARLY* is not good enough for me.
>
Then make backups and make THOSE ZFS mirrors (which is what I do),
dismount them and put them in a second location. That's what backups
are for and ZFS makes that pretty easy and efficient with snapshots and
send/receive. ZFS also makes segmenting data into "live" (updated a lot
and currently), "cool" (updated once in a while) and "online but cold"
(either R/O or close to it) and moving data from one classification to
another (which happens often over time) and that in turn allows you to
*easily* build a backup strategy that works for all of that and yet
doesn't involve trying to cart a petabyte around in your car to and from
the bank vault. It also allows a snapshot capacity tailored to each
requirement so an "oops" (as opposed to hardware failures) can be
trivially recovered from which is not easy to do with UFS at all.
If you have a 0.001 (1 in a thousand) risk over some period of time and
and have a second copy off-site with the same risk the odds of both
failing at the same time by other than physical calamity that takes out
both locations are multiplicative. Increase your redundancy as required
until you're comfortable. If "Tsar Bomba" gets dropped on my location,
well..... :-)
I've had the ugly and wildly-improbable nightmare scenario happen to me
in real life when I ran MCSNet -- a disk adapter that went insane
internally and scribbled on *every attached device* at once. It
destroyed the data on ALL of them. If I didn't have a backup I would
have been *done* right there. This was before ZFS and the pucker factor
on that was considerable.
>> return the block at all;
>
> You mix device blocks and ZFS block. As far as i remember default ZFS
> block for checksumming is 16K and for big files storage better to have
> it around 128K.
No I don't. A block is a block is a block; they're different sizes
depending on application and where in the stack of "things" between the
physical disk and the application doing the reading or writing. In some
cases I run small block sizes on ZFS, in others large block sizes.
Depends on the workload. But I have control of that; I have no control
over what the physical media does inside its firmware. In some cases its
idea of a "block" is small, in others its large, and in still others
(SSDs in particular) how big it is depends on which block. Scramble an
allocation table in an SSD's controller and frequently everything on the
drive is gone *at once.* The scope of damage if you get an uncorrected
error off a given media depends on many things (e.g. encoding in use,
the media's sector size, is the error in the data or is in the servo
information on the platter if the device is spinning rust, if it's an
SSD is the error in the data block or is it in the allocation/mapping
table storage, etc.)
>> In short there are very, very few actual "in the wild" failures where
>> one byte is damaged and the rest surrounding that one byte is intact
>> and retrievable. In most cases where an actual failure occurs the
>> unreadable data constitutes *at least* a physical sector.
>>
> "very very few" is enough for me to think about.
Then keep backups and make sure THEY are redundant too. Do patrol scans
(e.g. scrubs) on those on some reasonable basis so you are comfortable
that the backups are good.
>
> One more thing. If you have one bad byte in a block of 16K and you
> have checksum and recalculate it then it is quite possible to just
> brute force every byte to match the checksum, thus restoring the data.
True. But as I said across an unbelievable number of failures that I've
seen in my close to 40 years doing this stuff I've yet to have *one*
instance in which a media failure took out less than at least one
physical sector of the device and if it does, I still have one more
error of the *same sort* that has to happen in the same logical block on
the other vdev of the mirror, otherwise it gets detected and fixed with
no harm and no foul. I'm sure it's possible for me to get screwed but
then again it's also possible for me to get hit by an asteroid while
getting my mail -- it's just statistically improbable to the point that
I don't worry about it very much.
I'm a LOT more worried about a failure on enough *other* vdevs during a
resilver to lose redundancy AND data, which can really hose you. That is
a hell of a lot more-likely statistically and if it happens you better
have backups because that second failure, being unrelated to the first,
hoses you *hard*.
>
> If you have mirror with two different bytes then bute forcing is even
> ether,
>
> Somehow, ZFS slaps my hands and does not allow to be sure that i can
> restore data when i needed it and decide myself if it is okay or not.
>
> For long time storage of big files it now seems better to store it on
> UFS mirror, checksum each 512bytes blocks of files and store checksums
> separetelly and run monthly/weekly "scrub". This way i would sleep
> better.
>
For the love of God no. For openers how do you detect a corrupted
checksum file and know it's *the file* as opposed to the data?
There are good reasons to run UFS instead of ZFS in many cases but this
is definitely not one of them.
--
Karl Denninger
karl at denninger.net <mailto:karl at denninger.net>
/The Market Ticker/
/[S/MIME encrypted email preferred]/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4897 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20210212/6ab1820f/attachment-0001.bin>
More information about the freebsd-fs
mailing list