propose: change some sense codes handling

Tue Apr 5 11:46:27 UTC 2011

On Apr 5, 2011, at 1:15 PM, Andriy Gapon wrote:

> 
> I propose the following changes:
> 
> -	{ SST(0x28, 0x00, SS_FATAL | ENXIO,
> +	{ SST(0x28, 0x00, SS_TUR | SSQ_MANY | SSQ_DECREMENT_COUNT | EBUSY,
> 	    "Not ready to ready change, medium may have changed") },
> In my opinion this condition doesn't really mean a fatal error, but implies that
> we should retry while new medium "settles down".

As far as I know, this shouldn't be reported by a non-removable media device. It should be used by removable media such as tape units, magneto-optical drives, CDROM drives, WORMs...

Many years ago I used to write to SCSI tapes. If the operator changed a tape, for example, while the tape was idle, the next read or write command returned this code, indicating that there was a media change. And it was important indeed, as our application sometimes wrote to tape in relatively small chunks and it only rewound the tape when necessary.

So, if the system was expecting a given tape to be in the unit and it tried to write, that try failed reporting a tape change. The software issued a rewind command and read the tape label to ensure that it was the right tape (in which case it issued a seek to the end of the recorded data) or created a new tape label, labelled it, etc etc.

Assuming that manufacturers are using it as expected, if this was reported by a removable media random access device (say, a magneto optical disk) it should result in the disappearance of the "changed disk", creation of a new disk. I mean, reread partition table et all, and invalidation of any mount points related to the "disappeared" device. 

> In my testing this change actually helps with some USB flashdrives and
> cardreaders with slow access to media.

If a card read reports this, I assume that either the reader has crappy firmware _or_ it has an electrical contact problem with the media. But ignoring this error just could lead to data loss. In the case of a user replacing a memory card with a mounted filesystem, it would be certainly a data loss (blocks intended for one card written to a different card?)

> Perhaps some real SCSI devices use this sense code to signal a really "fatal"
> condition?  Please let me know.
> 
> --- a/sys/cam/scsi/scsi_all.c
> +++ b/sys/cam/scsi/scsi_all.c
> @@ -1448,7 +1448,7 @@ static struct asc_table_entry asc_table[] = {
> 	 * the networking errnos?  ECONNRESET anyone?
> 	 */
> 	/* DTLPWROMAEBKVF */
> -	{ SST(0x29, 0x00, SS_FATAL | ENXIO,
> +	{ SST(0x29, 0x00, SS_RDEF,
> 	    "Power on, reset, or bus device reset occurred") },
> 	/* DTLPWROMAEBKVF */
> 	{ SST(0x29, 0x01, SS_RDEF,
> 
> Align handling of this condition with the rest of the conditions in the same
> family: "Power on occurred", "SCSI bus reset occurred", "Bus device reset
> function occurred", etc.
> I don't see this particular condition should be special.
> Any insights and/or historical reasons?

I would  be cautious with this. Of course if it happened with no outstanding operations and data committed to media, it should be harmless. But if you power cycle a hard disk with a dirty cache, some of the data won't be committed to disk. If you just retry the operation and otherwise ignore the message (which is equivalent to just logging and retrying) you keep writing data to a possibly corrupted medium. It can certainly led to further corruption and make the problem worse.

My opinion, of course ;)

Borja.