Overlapped Commands error
Scott Long
scottl at samsco.org
Wed Jun 16 23:32:23 UTC 2010
On Jun 16, 2010, at 10:17 AM, Andrew Boyer wrote:
> Hello SCSI experts,
> We recently saw this SCSI command error:
>
>> Jun 15 15:08:32 eval12 kernel: (da1:mpt0:0:1:0): READ(10). CDB: 28 0 2 c8 7f a0 0 0 20 0
>> Jun 15 15:08:32 eval12 kernel: (da1:mpt0:0:1:0): CAM Status: SCSI Status Error
>> Jun 15 15:08:32 eval12 kernel: (da1:mpt0:0:1:0): SCSI Status: Check Condition
>> Jun 15 15:08:32 eval12 kernel: (da1:mpt0:0:1:0): ABORTED COMMAND asc:4e,0
>> Jun 15 15:08:32 eval12 kernel: (da1:mpt0:0:1:0): Overlapped commands attempted field replaceable unit: 1
>> Jun 15 15:08:32 eval12 kernel: (da1:mpt0:0:1:0): Retrying Command (per Sense Data)
>> Jun 15 15:08:37 eval12 kernel: mpt0: request 0xffffffff815d5c20:40101 timed out for ccb 0xffffff000d54d800 (req->ccb 0xffffff000d54d800)
>> Jun 15 15:08:37 eval12 kernel: mpt0: attempting to abort req 0xffffffff815d5c20:40101 function 0
>> Jun 15 15:08:38 eval12 kernel: mpt0: mpt_wait_req(1) timed out
>> Jun 15 15:08:38 eval12 kernel: mpt0: mpt_recover_commands: abort timed-out. Resetting controller
>> Jun 15 15:08:38 eval12 kernel: mpt0: mpt_cam_event: 0x0
>> Jun 15 15:08:38 eval12 kernel: mpt0: mpt_cam_event: 0x0
>> Jun 15 15:08:38 eval12 kernel: mpt0: completing timedout/aborted req 0xffffffff815d5c20:40101
>> Jun 15 15:09:00 eval12 kernel: mpt0: mpt_cam_event: 0x16
>> Jun 15 15:09:00 eval12 kernel: mpt0: mpt_cam_event: 0x12
>> Jun 15 15:09:00 eval12 kernel: mpt0: mpt_cam_event: 0x16
>
> No one here has ever seen this before. We're using a CAM and MPT stack from August 2009 with an LSI1068e HBA connected to Seagate SAS HDDs.
>
> This is what the SCSI Architecture Manual (SAM-5 draft) has to say about overlapped commands:
>> [...]
>
> Can anyone point me to where in the stack the command identifier is assigned? I see where MPT assigns tags in target mode, but it's the initiator in this case. Any advice?
Don't want to step on Matt, but wanted to expand on what he's said so far.
CAM doesn't assign tag identifiers for initiator I/O, it leaves that up to the driver and hardware. The tag_id field that you see in CCB's is for target I/O only. In the case of MPT, the firmware assigns tags, while on simpler controllers like ESP the driver does it. CAM does provide the tag action message, i.e. SIMPLE, ORDERED, HEAD_OF_Q, and it's up to the driver to relay that to hardware, which MPT does in mpt_start().
The MPT architecture abstracts a lot of the transport protocol away, so it's generally assumed that it's going to do the right thing in a case like this. I don't know if the firmware is wrong, or if FreeBSD is wrong. CAM almost always attaches a SIMPLE action flag with I/O commands, and the MPT driver looks like it will faithfully translate that into the corresponding MPT flag. By looking at the inquiry data, it's roughly possible to determine if the device supports tagged queuing, so maybe CAM needs to be smarter about this. Instead of the TQ flag just affecting command scheduling, maybe it also needs to suppress attaching the SIMPLE action flag, and likewise the MPT driver should set an UNTAGGED flag in correlation to that.
I would expect the MPT firmware to look at the inquiry data and behave appropriately despite what might be sent in the MPT i/o request, but again, maybe that's asking too much. If you're adventurous, try modifying the MPT driver to always set the MPI_SCSIIO_CONTROL_UNTAGGED flag in mpt_start(), and see if that makes your problem go away.
>
> Also, is CAM doing the right thing by retrying? scsi_error_action() in cam/scsi/scsi_all.c always sets the retry bit on aborted commands, even though the spec quoted above makes it sound like this should be a fatal error ("This is considered a catastrophic failure on the part of the SCSI initiator device"). Should scsi_error_action() be looking at the Additional Sense Code?
>
The error recovery code in CAM already cross references the ASC/ASCQ to an action table, but that table is often incomplete for uncommon edge cases. Try the following:
RCS file: /usr1/ncvs/src/sys/cam/scsi/scsi_all.c,v
retrieving revision 1.55.2.3
diff -u -r1.55.2.3 scsi_all.c
--- scsi_all.c 14 Feb 2010 19:38:27 -0000 1.55.2.3
+++ scsi_all.c 16 Jun 2010 23:31:47 -0000
@@ -1962,7 +1962,7 @@
{ SST(0x4D, 0xFF, SS_RDEF | SSQ_RANGE,
NULL) }, /* Range 0x00->0xFF */
/* DTLPWROMAEBKVF */
- { SST(0x4E, 0x00, SS_RDEF,
+ { SST(0x4E, 0x00, SS_FATAL | ENXIO,
"Overlapped commands attempted") },
/* T */
{ SST(0x50, 0x00, SS_RDEF,
Scott
More information about the freebsd-scsi
mailing list