Re: ASC/ASCQ Review

From: Warner Losh <imp_at_bsdimp.com>
Date: Wed, 19 Jul 2023 15:41:37 UTC
btw, it also occurs to me that if I do add a 'secondary' table, then you
could use it to generate a unique errno and experiment
with that w/o affecting the main code until that stuff was mature.

I'm not sure I'll do that now, since I've found maybe 10 asc/ascq pairs
that I'd like to tag as 'if trying harder, retry, otherwise fail' since
re-retry needs have changed a lot since cam was written in the late 90s and
at least some of the asc/ascq pairs I'm looking at haven't changed since
the initial import, but that's based on a tiny sampling of the data I have
and is preliminary at best. I may just change it to reflect modern usage.

Warner

On Fri, Jul 14, 2023 at 5:34 PM Warner Losh <imp@bsdimp.com> wrote:

>
>
> On Fri, Jul 14, 2023 at 12:31 PM Alan Somers <asomers@freebsd.org> wrote:
>
>> On Fri, Jul 14, 2023 at 11:05 AM Warner Losh <imp@bsdimp.com> wrote:
>> >
>> >
>> >
>> > On Fri, Jul 14, 2023, 11:12 AM Alan Somers <asomers@freebsd.org> wrote:
>> >>
>> >> On Thu, Jul 13, 2023 at 12:14 PM Warner Losh <imp@bsdimp.com> wrote:
>> >> >
>> >> > Greetings,
>> >> >
>> >> > i've been looking closely at failed drives for $WORK lately. I've
>> noticed that a lot of errors that kinda sound like fatal errors have
>> SS_RDEF set on them.
>> >> >
>> >> > What's the process for evaluating whether those error codes are
>> worth retrying. There are several errors that we seem to be seeing
>> (preliminary read of the data) before the drive gives up the ghost
>> altogether. For those cases, I'd like to post more specific lists. Should I
>> do that here?
>> >> >
>> >> > Independent of that, I may want to have a more aggressive 'fail
>> fast' policy than is appropriate for my work load (we have a lot of data
>> that's a copy of a copy of a copy, so if we lose it, we don't care: we'll
>> just delete any files we can't read and get on with life, though I know
>> others will have a more conservative attitude towards data that might be
>> precious and unique). I can set the number of retries lower, I can do some
>> other hacks for disks that tell the disk to fail faster, but I think part
>> of the solution is going to have to be failing for some sense-code/ASC/ASCQ
>> tuples that we don't want to fail in upstream or the general case. I was
>> thinking of identifying those and creating a 'global quirk table' that gets
>> applied after the drive-specific quirk table that would let $WORK override
>> the defaults, while letting others keep the current behavior. IMHO, it
>> would be better to have these separate rather than in the global data for
>> tracking upstream...
>> >> >
>> >> > Is that clear, or should I give concrete examples?
>> >> >
>> >> > Comments?
>> >> >
>> >> > Warner
>> >>
>> >> Basically, you want to change the retry counts for certain ASC/ASCQ
>> >> codes only, on a site-by-site basis?  That sounds reasonable.  Would
>> >> it be configurable at runtime or only at build time?
>> >
>> >
>> > I'd like to change the default actions. But maybe we just do that for
>> everyone and assume modern drives...
>> >
>> >> Also, I've been thinking lately that it would be real nice if READ
>> >> UNRECOVERABLE could be translated to EINTEGRITY instead of EIO.  That
>> >> would let consumers know that retries are pointless, but that the data
>> >> is probably healable.
>> >
>> >
>> > Unlikely, unless you've tuned things to not try for long at recovery...
>> >
>> > But regardless... do you have a concrete example of a use case? There's
>> a number of places that map any error to EIO. And I'd like a use case
>> before we expand the errors the lower layers return...
>> >
>> > Warner
>>
>> My first use-case is a user-space FUSE file system.  It only has
>> access to errnos, not ASC/ASCQ codes.  If we do as I suggest, then it
>> could heal a READ UNRECOVERABLE by rewriting the sector, whereas other
>> EIO errors aren't likely to be healed that way.
>>
>
> Yea... but READ UNRECOVERABLE is kinda hit or miss...
>
>
>> My second use-case is ZFS.  zfsd treats checksum errors differently
>> from I/O errors.  A checksum error normally means that a read returned
>> wrong data.  But I think that READ UNRECOVERABLE should also count.
>> After all, that means that the disk's media returned wrong data which
>> was detected by the disk's own EDC/ECC.  I've noticed that zfsd seems
>> to fault disks too eagerly when their only problem is READ
>> UNRECOVERABLE errors.  Mapping it to EINTEGRITY, or even a new error
>> code, would let zfsd be tuned better.
>>
>
> EINTEGRITY would then mean two different things. UFS returns in when
> checksums fail for critical filesystem errors. I'm not saying no, per se,
> just that it conflates two different errors.
>
> I think both of these use cases would be better served by CAM's publishing
> of the errors to devctl today. Here's some example data from a system I'm
> looking at:
>
> system=CAM subsystem=periph type=timeout device=da36 serial="12345"
> cam_status="0x44b" timeout=30000 CDB="28 00 4e b7 cb a3 00 04 cc 00 "
>  timestamp=1634739729.312068
> system=CAM subsystem=periph type=timeout device=da36 serial="12345"
> cam_status="0x44b" timeout=30000 CDB="28 00 20 6b d5 56 00 00 c0 00 "
>  timestamp=1634739729.585541
> system=CAM subsystem=periph type=error device=da36 serial="12345"
> cam_status="0x4cc" scsi_status=2 scsi_sense="72 03 11 00" CDB="28 00 ad 1a
> 35 96 00 00 56 00 " timestamp=1641979267.469064
> system=CAM subsystem=periph type=error device=da36 serial="12345"
> cam_status="0x4cc" scsi_status=2 scsi_sense="72 03 11 00" CDB="28 00 ad 1a
> 35 96 00 01 5e 00 "  timestamp=1642252539.693699
> system=CAM subsystem=periph type=error device=da39 serial="12346"
> cam_status="0x4cc" scsi_status=2 scsi_sense="72 04 02 00" CDB="2a 00 01 2b
> c8 f6 00 07 81 00 "  timestamp=1669603144.090835
>
> Here we get the sense key, the asc and the ascq in the scsi_sense data
> (I'm currently looking at expanding this to the entire sense buffer, since
> it includes how hard the drive tried to read the data on media and hardware
> errors).  It doesn't include nvme data, but does include ata data (I'll
> have to add that data, now that I've noticed it is missing).  With the
> sense data and the CDB you know what kind of error you got, plus what block
> didn't read/write correctly. With the extended sense data, you can find out
> even more details that are sense-key dependent...
>
> So I'm unsure that trying to shoehorn our imperfect knowledge of what's
> retriable, fixable, should be written with zeros into the kernel and
> converting that to a separate errno would give good results, and tapping
> into this stream daemons that want to make more nuanced calls about disks
> might be the better way to go. One of the things I'm planning for $WORK is
> to enable the retry time limit of one of the mode pages so that we fail
> faster and can just delete the file with the 'bad' block that we'd get
> eventually if we allowed the full, default error processing to run, but
> that 'slow path' processing kills performance for all other users of the
> drive...  I'm unsure how well that will work out (and I know I'm lucky that
> I can always recover any data for my application since it's just a cache).
>
> I'd be interested to hear what others have to say here thought, since my
> focus on this data is through the lense of my rather specialized
> application...
>
> Warner
>
> P.S. That was generated with this rule if you wanted to play with it...
> You'd have to translate absolute disk blocks to a partition and an offset
> into the filesystem, then give the filesystem a chance to tell you what of
> its data/metadata that block is used for...
>
> # Disk errors
> notify 10 {
>         match "system"          "CAM";
>         match "subsystem"       "periph";
>         match "device"          "[an]?da[0-9]+";
>         action "logger -t diskerr -p daemon.info $_ timestamp=$timestamp";
> };
>
>