Re: ASC/ASCQ Review
- Reply: Warner Losh : "Re: ASC/ASCQ Review"
- In reply to: Warner Losh : "Re: ASC/ASCQ Review"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Fri, 21 Jul 2023 03:18:44 UTC
On 2023-07-19 11:41, Warner Losh wrote: > btw, it also occurs to me that if I do add a 'secondary' table, then you could > use it to generate a unique errno and experiment > with that w/o affecting the main code until that stuff was mature. > > I'm not sure I'll do that now, since I've found maybe 10 asc/ascq pairs that I'd > like to tag as 'if trying harder, retry, otherwise fail' since re-retry needs > have changed a lot since cam was written in the late 90s and at least some of > the asc/ascq pairs I'm looking at haven't changed since the initial import, but > that's based on a tiny sampling of the data I have and is preliminary at best. I > may just change it to reflect modern usage. Hi, If you are looking for up-to-date [20230325] asc/ascq tables in C you could borrow mine at https://github.com/doug-gilbert/sg3_utils in lib/sg_lib_data.c starting at line 745 . In testing/sg_chk_asc.c is a small test program for checking that the table in sg_lib_data.c agrees with the file that T10 supplies: https://www.t10.org/lists/asc-num.txt Doug Gilbert > On Fri, Jul 14, 2023 at 5:34 PM Warner Losh <imp@bsdimp.com > <mailto:imp@bsdimp.com>> wrote: > > > > On Fri, Jul 14, 2023 at 12:31 PM Alan Somers <asomers@freebsd.org > <mailto:asomers@freebsd.org>> wrote: > > On Fri, Jul 14, 2023 at 11:05 AM Warner Losh <imp@bsdimp.com > <mailto:imp@bsdimp.com>> wrote: > > > > > > > > On Fri, Jul 14, 2023, 11:12 AM Alan Somers <asomers@freebsd.org > <mailto:asomers@freebsd.org>> wrote: > >> > >> On Thu, Jul 13, 2023 at 12:14 PM Warner Losh <imp@bsdimp.com > <mailto:imp@bsdimp.com>> wrote: > >> > > >> > Greetings, > >> > > >> > i've been looking closely at failed drives for $WORK lately. I've > noticed that a lot of errors that kinda sound like fatal errors have > SS_RDEF set on them. > >> > > >> > What's the process for evaluating whether those error codes are > worth retrying. There are several errors that we seem to be seeing > (preliminary read of the data) before the drive gives up the ghost > altogether. For those cases, I'd like to post more specific lists. > Should I do that here? > >> > > >> > Independent of that, I may want to have a more aggressive 'fail > fast' policy than is appropriate for my work load (we have a lot of data > that's a copy of a copy of a copy, so if we lose it, we don't care: > we'll just delete any files we can't read and get on with life, though I > know others will have a more conservative attitude towards data that > might be precious and unique). I can set the number of retries lower, I > can do some other hacks for disks that tell the disk to fail faster, but > I think part of the solution is going to have to be failing for some > sense-code/ASC/ASCQ tuples that we don't want to fail in upstream or the > general case. I was thinking of identifying those and creating a 'global > quirk table' that gets applied after the drive-specific quirk table that > would let $WORK override the defaults, while letting others keep the > current behavior. IMHO, it would be better to have these separate rather > than in the global data for tracking upstream... > >> > > >> > Is that clear, or should I give concrete examples? > >> > > >> > Comments? > >> > > >> > Warner > >> > >> Basically, you want to change the retry counts for certain ASC/ASCQ > >> codes only, on a site-by-site basis? That sounds reasonable. Would > >> it be configurable at runtime or only at build time? > > > > > > I'd like to change the default actions. But maybe we just do that for > everyone and assume modern drives... > > > >> Also, I've been thinking lately that it would be real nice if READ > >> UNRECOVERABLE could be translated to EINTEGRITY instead of EIO. That > >> would let consumers know that retries are pointless, but that the data > >> is probably healable. > > > > > > Unlikely, unless you've tuned things to not try for long at recovery... > > > > But regardless... do you have a concrete example of a use case? > There's a number of places that map any error to EIO. And I'd like a use > case before we expand the errors the lower layers return... > > > > Warner > > My first use-case is a user-space FUSE file system. It only has > access to errnos, not ASC/ASCQ codes. If we do as I suggest, then it > could heal a READ UNRECOVERABLE by rewriting the sector, whereas other > EIO errors aren't likely to be healed that way. > > > Yea... but READ UNRECOVERABLE is kinda hit or miss... > > My second use-case is ZFS. zfsd treats checksum errors differently > from I/O errors. A checksum error normally means that a read returned > wrong data. But I think that READ UNRECOVERABLE should also count. > After all, that means that the disk's media returned wrong data which > was detected by the disk's own EDC/ECC. I've noticed that zfsd seems > to fault disks too eagerly when their only problem is READ > UNRECOVERABLE errors. Mapping it to EINTEGRITY, or even a new error > code, would let zfsd be tuned better. > > > EINTEGRITY would then mean two different things. UFS returns in when > checksums fail for critical filesystem errors. I'm not saying no, per se, > just that it conflates two different errors. > > I think both of these use cases would be better served by CAM's publishing > of the errors to devctl today. Here's some example data from a system I'm > looking at: > > system=CAM subsystem=periph type=timeout device=da36 serial="12345" > cam_status="0x44b" timeout=30000 CDB="28 00 4e b7 cb a3 00 04 cc 00 " > timestamp=1634739729.312068 > system=CAM subsystem=periph type=timeout device=da36 serial="12345" > cam_status="0x44b" timeout=30000 CDB="28 00 20 6b d5 56 00 00 c0 00 " > timestamp=1634739729.585541 > system=CAM subsystem=periph type=error device=da36 serial="12345" > cam_status="0x4cc" scsi_status=2 scsi_sense="72 03 11 00" CDB="28 00 ad 1a > 35 96 00 00 56 00 " timestamp=1641979267.469064 > system=CAM subsystem=periph type=error device=da36 serial="12345" > cam_status="0x4cc" scsi_status=2 scsi_sense="72 03 11 00" CDB="28 00 ad 1a > 35 96 00 01 5e 00 " timestamp=1642252539.693699 > system=CAM subsystem=periph type=error device=da39 serial="12346" > cam_status="0x4cc" scsi_status=2 scsi_sense="72 04 02 00" CDB="2a 00 01 2b > c8 f6 00 07 81 00 " timestamp=1669603144.090835 > > Here we get the sense key, the asc and the ascq in the scsi_sense data (I'm > currently looking at expanding this to the entire sense buffer, since it > includes how hard the drive tried to read the data on media and hardware > errors). It doesn't include nvme data, but does include ata data (I'll have > to add that data, now that I've noticed it is missing). With the sense data > and the CDB you know what kind of error you got, plus what block didn't > read/write correctly. With the extended sense data, you can find out even > more details that are sense-key dependent... > > So I'm unsure that trying to shoehorn our imperfect knowledge of what's > retriable, fixable, should be written with zeros into the kernel and > converting that to a separate errno would give good results, and tapping > into this stream daemons that want to make more nuanced calls about disks > might be the better way to go. One of the things I'm planning for $WORK is > to enable the retry time limit of one of the mode pages so that we fail > faster and can just delete the file with the 'bad' block that we'd get > eventually if we allowed the full, default error processing to run, but that > 'slow path' processing kills performance for all other users of the > drive... I'm unsure how well that will work out (and I know I'm lucky that > I can always recover any data for my application since it's just a cache). > > I'd be interested to hear what others have to say here thought, since my > focus on this data is through the lense of my rather specialized application... > > Warner > > P.S. That was generated with this rule if you wanted to play with it... > You'd have to translate absolute disk blocks to a partition and an offset > into the filesystem, then give the filesystem a chance to tell you what of > its data/metadata that block is used for... > > # Disk errors > notify 10 { > match "system" "CAM"; > match "subsystem" "periph"; > match "device" "[an]?da[0-9]+"; > action "logger -t diskerr -p daemon.info <http://daemon.info> $_ > timestamp=$timestamp"; > }; >