terminated ioc 804b scsi 0 state c xfer 0

Scott Long scott4long at yahoo.com
Mon Apr 25 17:32:57 UTC 2016


> On Apr 25, 2016, at 10:38 AM, Stephen McConnell via freebsd-scsi <freebsd-scsi at freebsd.org> wrote:
> 
> 
> 
>> -----Original Message-----
>> From: owner-freebsd-scsi at freebsd.org [mailto:owner-freebsd-
>> scsi at freebsd.org] On Behalf Of Dan Langille
>> Sent: Monday, April 25, 2016 9:40 AM
>> To: freebsd-scsi at freebsd.org
>> Subject: Re: terminated ioc 804b scsi 0 state c xfer 0
>> 
>>> On Apr 25, 2016, at 8:17 AM, Dan Langille <dan at langille.org> wrote:
>>> 
>>>> 
>>>> On Apr 24, 2016, at 9:35 AM, Dan Langille <dan at langille.org> wrote:
>>>> 
>>>> More of the pasted output is also at
>> https://gist.github.com/dlangille/1fa3135334089c6603e2ec5da946d9ae
>> <https://gist.github.com/dlangille/1fa3135334089c6603e2ec5da946d9ae>
>> and added smartctl output.
>>>> 
>>>> I have a FreeBSD 10.2-RELEASE-p14 box in which there is an LSI SAS2008
>> card.  It's running a zfs root system.
>>>> 
>>>> This morning the system was unresponsive via ssh. Attempts to log in at
>> the console did not yield a password prompt.
>>>> 
>>>> A power cycle brought the system online.  Inspecting /var/log/messages,
> I
>> found about 63,000 entries similar to those which appear below.
>>>> 
>>>> zpool status of all are OK. A scrub is in progress for one pool (since
> before
>> this issue arose). da7 is in that pool.
>>>> 
>>>> 
>>>> Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00
>>>> 8d 90 c6 18 00 00 10 00 length 8192 SMID 774 terminated ioc 804b scsi
>>>> 0 state c xfer 0 Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0):
>>>> READ(10). CDB: 28 00 8b d9 97 70 00 00 20 00 length 16384 SMID 614
>>>> terminated ioc 804b scsi 0 state c xfer 0 Apr 24 11:25:55 knew
>>>> kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b d9 97 50 00 00 20
>>>> 00 length 16384 SMID 792 terminated ioc 804b scsi 0 state c xfer 0
>>>> Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00
>>>> 8b d9 97 08 00 00 20 00 length 16384 SMID 974 terminated ioc 804b
>>>> scsi 0 state c xfer 0 Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0):
>>>> READ(10). CDB: 28 00 8b 6f ef 50 00 00 08 00 length 4096 SMID 674
>>>> terminated ioc 804b scsi 0 state c xfer 0 Apr 24 11:25:55 knew
>>>> kernel: (da7:mps1:0:17:0): WRITE(10). CDB: 2a 00 8b 0f a2 48 00 00 18
>>>> 00 length 12288 SMID 177 terminated ioc 804b scsi 0 state c xfer
>>>> 12288 Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB:
>>>> 28 00 ab 8f a1 38 00 00 08 00 length 4096 SMID 908 terminated ioc
>>>> 804b scsi 0 state c xfer 0 Apr 24 11:25:56 knew kernel:
>>>> (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b d9 97 70 00 00 20 00
>>>> length 16384 SMID 376 terminated ioc 804b scsi 0 state c xfer 0 Apr
>>>> 24 11:25:56 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b
>>>> d9 97 50 00 00 20 00 length 16384 SMID 172 terminated ioc 804b scsi 0
>>>> state c xfer 0
>>>> 
>>>> Is this a cabling issue?  The drive is a SATA device (smartctl output
> in the
>> URL above).  Anyone familiar with these errors?
>>> 
>>> This morning:
>>> 
>>> 13410079654596185797  REMOVED      0     0     0  was /dev/da7p3
>>> 
>>> At least I know i'm looking for Serial Number: 13Q8PNBYS
>>> 
>>> From the logs:
>>> 
>>> Apr 25 05:34:50 knew kernel: da7 at mps1 bus 0 scbus1 target 17 lun 0
>>> Apr 25 05:34:50 knew kernel: da7: <ATA TOSHIBA DT01ACA3 ABB0> s/n
>> 13Q8PNBYS detached
>>> Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00
>>> d8 33 53 e0 00 00 08 00 length 4096 SMID 88 terminated ioc 804b scsi 0
>>> state c xfer 0 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0):
>>> READ(10). CDB: 28 00 d8 33 26 f8 00 00 20 00 length 16384 SMID 204
>>> terminated ioc 804b scsi 0 state c xfer(da7:mps1:0:17:0): READ(10). CDB:
> 28
>> 00 d8 33 53 e0 00 00 08 00 Apr 25 05:34:51 knew kernel: 0 Apr 25 05:34:51
>> knew kernel: (da7:mps1:0:17:0): CAM status: Unconditionally Re-queue
>> Request Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28
>> 00 d8 33 26 d8 00 00 20 00 length 16384 SMID 260 terminated ioc 804b scsi
> 0
>> state c xfer(da7: 0
>>> Apr 25 05:34:51 knew kernel: mps1:0:    (da7:mps1:0:17:0): READ(10).
> CDB:
>> 28 00 e6 6c 42 40 00 00 10 00 length 8192 SMID 484 terminated ioc 804b
> scsi 0
>> state c xfer 17:0
>>> Apr 25 05:34:51 knew kernel: 0):        (da7:mps1:0:17:0): WRITE(10).
> CDB: 2a
>> 00 e4 d8 2a 90 00 00 90 00 length 73728 SMID 548 terminated ioc 804b scsi
> 0
>> state c xfeError 5, Periph was invalidated
>>> Apr 25 05:34:51 knew kernel: r 0
>>> Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00
>>> d8 33 26 f8 00 00 20 00 Apr 25 05:34:51 knew kernel:
>>> (da7:mps1:0:17:0): READ(10). CDB: 28 00 4d ac ed b8 00 00 08 00 length
>>> 4096 SMID 435 terminated ioc 804b scsi 0 state c xfer
>>> (da7:mps1:0:17:0): CAM status: Unconditionally Re-queue Request Apr 25
>>> 05:34:51 knew kernel: 0 Apr 25 05:34:51 knew kernel: (da7:mps1:
>>> mps1:0:IOCStatus = 0x4b while resetting device 0xa Apr 25 05:34:51
>>> knew kernel: 17:mps1: 0): Unfreezing devq for target ID 17 Apr 25
>>> 05:34:51 knew kernel: Error 5, Periph was invalidated Apr 25 05:34:51
>>> knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 d8 33 26 d8 00 00
>>> 20 00 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status:
>>> Unconditionally Re-queue Request Apr 25 05:34:51 knew kernel:
>>> (da7:mps1:0:17:0): Error 5, Periph was invalidated Apr 25 05:34:51
>>> knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 e6 6c 42 40 00 00
>>> 10 00 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status:
>>> Unconditionally Re-queue Request Apr 25 05:34:51 knew kernel:
>>> (da7:mps1:0:17:0): Error 5, Periph was invalidated Apr 25 05:34:51
>>> knew kernel: (da7:mps1:0:17:0): WRITE(10). CDB: 2a 00 e4 d8 2a 90 00
>>> 00 90 00 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status:
>> Unconditionally Re-queue Request Apr 25 05:34:51 knew kernel:
>> (da7:mps1:0:17:0): Error 5, Periph was invalidated Apr 25 05:34:51 knew
>> kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 4d ac ed b8 00 00 08 00
> Apr
>> 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status: Unconditionally
> Re-
>> queue Request Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): Error 5,
>> Periph was invalidated Apr 25 05:34:51 knew kernel: GEOM_MIRROR: Device
>> swap: provider da7p2 disconnected.
>>> Apr 25 05:34:51 knew devd: Executing 'logger -p kern.notice -t ZFS 'vdev
> is
>> removed, pool_guid=15378250086669402288
>> vdev_guid=13410079654596185797''
>>> Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): Periph destroyed Apr
>>> 25 05:34:51 knew ZFS: vdev is removed, pool_guid=15378250086669402288
>>> vdev_guid=13410079654596185797
>> 
>> Current status: after powering off the box, reseating the cables for that
> drive, I
>> powered up the system and a resilver commenced which completed in 30
>> minutes.
>> 
>> Seems OK now.  I am not sure if the two events are related.
> 
> Recently, a bug was uncovered where a device is gets 'lost'.
> 
> Here's what happens:
> The first message in your "failure on Monday"  log is for
> 'mpssas_prepare_remove'.  This message is likely logged because the FW sends
> an event to the driver that the device is no longer responsive (pulled,
> cable issue, or something else).  When the driver gets this event, it sends
> a reset to the device to clear out any pending I/O.  This is where the
> 'terminated ioc' messages come from.  When the reset completes, the driver
> is supposed to send a SAS_IO_UNIT message to FW so that the DevHandle for
> that disk is removed from FW's list.  Then, when the device comes back
> on-line, everything is fine. But, with this bug, before that SAS_IO_UNIT
> message is sent to FW, the driver exits the function where that happens
> (mpssas_remove_device).  This happens where you see the log message,
> "IOCStatus - 0x4b while resetting device 0x0a".  The driver logs that
> message and then exits.  What the driver should do is log that message and
> continue on to send the SAS_IO_UNIT message to FW.  The fix is to remove the
> two lines in the driver shown here with '>>':
> 
> 	if (le16toh(reply->IOCStatus) != MPI2_IOCSTATUS_SUCCESS) {
> 		mps_printf(sc, "IOCStatus = 0x%x while resetting device
> 0x%x\n",
> 		   le16toh(reply->IOCStatus), handle);
>>> 		mpssas_free_tm(sc, tm);
>>> 		return;
> 	}
> 
> A reboot will solve the problem, as you saw, but the real fix is to remove
> the DevHandle as described above.  This fix will go into the driver before
> the next scheduled release and then MFC'd to 10.x.
> 

Thanks for the diagnosis, Steve.  I forgot about that case.  We should also make
this chain of events more evident in the syslog, it’s very confusing when it happens.
I’m not exactly sure yet what it should look like.

Scott




More information about the freebsd-scsi mailing list