terminated ioc 804b scsi 0 state c xfer 0

Dan Langille dan at langille.org
Tue May 31 16:20:16 UTC 2016


> On May 31, 2016, at 12:17 PM, Stephen McConnell <stephen.mcconnell at broadcom.com> wrote:
> 
> 
> 
>> -----Original Message-----
>> From: owner-freebsd-scsi at freebsd.org [mailto:owner-freebsd-
>> scsi at freebsd.org] On Behalf Of Dan Langille
>> Sent: Monday, May 30, 2016 12:28 PM
>> To: freebsd-scsi at freebsd.org
>> Subject: Re: terminated ioc 804b scsi 0 state c xfer 0
>> 
>>> On Apr 25, 2016, at 12:38 PM, Stephen McConnell
>> <stephen.mcconnell at broadcom.com> wrote:
>>> 
>>> 
>>> 
>>>> -----Original Message-----
>>>> From: owner-freebsd-scsi at freebsd.org [mailto:owner-freebsd-
>>>> scsi at freebsd.org] On Behalf Of Dan Langille
>>>> Sent: Monday, April 25, 2016 9:40 AM
>>>> To: freebsd-scsi at freebsd.org
>>>> Subject: Re: terminated ioc 804b scsi 0 state c xfer 0
>>>> 
>>>>> On Apr 25, 2016, at 8:17 AM, Dan Langille <dan at langille.org> wrote:
>>>>> 
>>>>>> 
>>>>>> On Apr 24, 2016, at 9:35 AM, Dan Langille <dan at langille.org> wrote:
>>>>>> 
>>>>>> More of the pasted output is also at
>>>> https://gist.github.com/dlangille/1fa3135334089c6603e2ec5da946d9ae
>>>> <https://gist.github.com/dlangille/1fa3135334089c6603e2ec5da946d9ae>
>>>> and added smartctl output.
>>>>>> 
>>>>>> I have a FreeBSD 10.2-RELEASE-p14 box in which there is an LSI
>>>>>> SAS2008
>>>> card.  It's running a zfs root system.
>>>>>> 
>>>>>> This morning the system was unresponsive via ssh. Attempts to log
>>>>>> in at
>>>> the console did not yield a password prompt.
>>>>>> 
>>>>>> A power cycle brought the system online.  Inspecting
>>>>>> /var/log/messages,
>>> I
>>>> found about 63,000 entries similar to those which appear below.
>>>>>> 
>>>>>> zpool status of all are OK. A scrub is in progress for one pool
>>>>>> (since
>>> before
>>>> this issue arose). da7 is in that pool.
>>>>>> 
>>>>>> 
>>>>>> Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28
>>>>>> 00 8d 90 c6 18 00 00 10 00 length 8192 SMID 774 terminated ioc 804b
>>>>>> scsi
>>>>>> 0 state c xfer 0 Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0):
>>>>>> READ(10). CDB: 28 00 8b d9 97 70 00 00 20 00 length 16384 SMID 614
>>>>>> terminated ioc 804b scsi 0 state c xfer 0 Apr 24 11:25:55 knew
>>>>>> kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b d9 97 50 00 00
>>>>>> 20
>>>>>> 00 length 16384 SMID 792 terminated ioc 804b scsi 0 state c xfer 0
>>>>>> Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28
>>>>>> 00 8b d9 97 08 00 00 20 00 length 16384 SMID 974 terminated ioc
>>>>>> 804b scsi 0 state c xfer 0 Apr 24 11:25:55 knew kernel:
> (da7:mps1:0:17:0):
>>>>>> READ(10). CDB: 28 00 8b 6f ef 50 00 00 08 00 length 4096 SMID 674
>>>>>> terminated ioc 804b scsi 0 state c xfer 0 Apr 24 11:25:55 knew
>>>>>> kernel: (da7:mps1:0:17:0): WRITE(10). CDB: 2a 00 8b 0f a2 48 00 00
>>>>>> 18
>>>>>> 00 length 12288 SMID 177 terminated ioc 804b scsi 0 state c xfer
>>>>>> 12288 Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB:
>>>>>> 28 00 ab 8f a1 38 00 00 08 00 length 4096 SMID 908 terminated ioc
>>>>>> 804b scsi 0 state c xfer 0 Apr 24 11:25:56 knew kernel:
>>>>>> (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b d9 97 70 00 00 20 00
>>>>>> length 16384 SMID 376 terminated ioc 804b scsi 0 state c xfer 0 Apr
>>>>>> 24 11:25:56 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b
>>>>>> d9 97 50 00 00 20 00 length 16384 SMID 172 terminated ioc 804b scsi
>>>>>> 0 state c xfer 0
>>>>>> 
>>>>>> Is this a cabling issue?  The drive is a SATA device (smartctl
>>>>>> output
>>> in the
>>>> URL above).  Anyone familiar with these errors?
>>>>> 
>>>>> This morning:
>>>>> 
>>>>> 13410079654596185797  REMOVED      0     0     0  was /dev/da7p3
>>>>> 
>>>>> At least I know i'm looking for Serial Number: 13Q8PNBYS
>>>>> 
>>>>> From the logs:
>>>>> 
>>>>> Apr 25 05:34:50 knew kernel: da7 at mps1 bus 0 scbus1 target 17 lun
>>>>> 0 Apr 25 05:34:50 knew kernel: da7: <ATA TOSHIBA DT01ACA3 ABB0> s/n
>>>> 13Q8PNBYS detached
>> 
>> Just for the record, this happened again this morning. Fixed by power
> cycle.
>> 
>> May 30 03:22:08 knew kernel: mps1: mpssas_prepare_remove: Sending reset
>> for target ID 17 May 30 03:22:10 knew kernel: da7 at mps1 bus 0 scbus1
> target
>> 17 lun 0
>> May 30 03:22:10 knew kernel: da7: <ATA TOSHIBA DT01ACA3 ABB0> s/n
>> 13Q8PNBYS detached
>> May 30 03:22:10 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 8c 5c
>> 91 c0 00 00 08 00 length 4096 SMID 179 terminated ioc 804b scsi 0 state c
> xfer
>> 0 May 30 03:22:10 knew kernel: (da7:mps1:0:17:0): WRITE(10). CDB: 2a 00 6b
>> bf db a0 00 00 f0 00 length 122880 SMID 938 terminated ioc 804b scsi 0
> state c
>> xf(da7:mps1:0:17:0): READ(10). CDB: 28 00 8c 5c 91 c0 00 00 08 00 May 30
>> 03:22:10 knew kernel: er 122880
>> 
> I just realized that you're using mps, not mpr.  The fix went into the mpr
> driver, but not mps yet.  It'll have to be ported over to mps.

This hit me again last night.  Same drive again.  Power cycle cleared it.

Now I'm wondering if it's heat or dud drive related.

> 
> Steve
> 
>>>>> Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00
>>>>> d8 33 53 e0 00 00 08 00 length 4096 SMID 88 terminated ioc 804b scsi 0
>>>>> state c xfer 0 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0):
>>>>> READ(10). CDB: 28 00 d8 33 26 f8 00 00 20 00 length 16384 SMID 204
>>>>> terminated ioc 804b scsi 0 state c xfer(da7:mps1:0:17:0): READ(10).
> CDB:
>>> 28
>>>> 00 d8 33 53 e0 00 00 08 00 Apr 25 05:34:51 knew kernel: 0 Apr 25
> 05:34:51
>>>> knew kernel: (da7:mps1:0:17:0): CAM status: Unconditionally Re-queue
>>>> Request Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB:
>> 28
>>>> 00 d8 33 26 d8 00 00 20 00 length 16384 SMID 260 terminated ioc 804b
> scsi
>>> 0
>>>> state c xfer(da7: 0
>>>>> Apr 25 05:34:51 knew kernel: mps1:0:    (da7:mps1:0:17:0): READ(10).
>>> CDB:
>>>> 28 00 e6 6c 42 40 00 00 10 00 length 8192 SMID 484 terminated ioc 804b
>>> scsi 0
>>>> state c xfer 17:0
>>>>> Apr 25 05:34:51 knew kernel: 0):        (da7:mps1:0:17:0): WRITE(10).
>>> CDB: 2a
>>>> 00 e4 d8 2a 90 00 00 90 00 length 73728 SMID 548 terminated ioc 804b
> scsi
>>> 0
>>>> state c xfeError 5, Periph was invalidated
>>>>> Apr 25 05:34:51 knew kernel: r 0
>>>>> Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00
>>>>> d8 33 26 f8 00 00 20 00 Apr 25 05:34:51 knew kernel:
>>>>> (da7:mps1:0:17:0): READ(10). CDB: 28 00 4d ac ed b8 00 00 08 00 length
>>>>> 4096 SMID 435 terminated ioc 804b scsi 0 state c xfer
>>>>> (da7:mps1:0:17:0): CAM status: Unconditionally Re-queue Request Apr 25
>>>>> 05:34:51 knew kernel: 0 Apr 25 05:34:51 knew kernel: (da7:mps1:
>>>>> mps1:0:IOCStatus = 0x4b while resetting device 0xa Apr 25 05:34:51
>>>>> knew kernel: 17:mps1: 0): Unfreezing devq for target ID 17 Apr 25
>>>>> 05:34:51 knew kernel: Error 5, Periph was invalidated Apr 25 05:34:51
>>>>> knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 d8 33 26 d8 00 00
>>>>> 20 00 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status:
>>>>> Unconditionally Re-queue Request Apr 25 05:34:51 knew kernel:
>>>>> (da7:mps1:0:17:0): Error 5, Periph was invalidated Apr 25 05:34:51
>>>>> knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 e6 6c 42 40 00 00
>>>>> 10 00 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status:
>>>>> Unconditionally Re-queue Request Apr 25 05:34:51 knew kernel:
>>>>> (da7:mps1:0:17:0): Error 5, Periph was invalidated Apr 25 05:34:51
>>>>> knew kernel: (da7:mps1:0:17:0): WRITE(10). CDB: 2a 00 e4 d8 2a 90 00
>>>>> 00 90 00 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status:
>>>> Unconditionally Re-queue Request Apr 25 05:34:51 knew kernel:
>>>> (da7:mps1:0:17:0): Error 5, Periph was invalidated Apr 25 05:34:51 knew
>>>> kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 4d ac ed b8 00 00 08 00
>>> Apr
>>>> 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status: Unconditionally
>>> Re-
>>>> queue Request Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): Error 5,
>>>> Periph was invalidated Apr 25 05:34:51 knew kernel: GEOM_MIRROR:
>> Device
>>>> swap: provider da7p2 disconnected.
>>>>> Apr 25 05:34:51 knew devd: Executing 'logger -p kern.notice -t ZFS
> 'vdev
>>> is
>>>> removed, pool_guid=15378250086669402288
>>>> vdev_guid=13410079654596185797''
>>>>> Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): Periph destroyed Apr
>>>>> 25 05:34:51 knew ZFS: vdev is removed,
>> pool_guid=15378250086669402288
>>>>> vdev_guid=13410079654596185797
>>>> 
>>>> Current status: after powering off the box, reseating the cables for
> that
>>> drive, I
>>>> powered up the system and a resilver commenced which completed in 30
>>>> minutes.
>>>> 
>>>> Seems OK now.  I am not sure if the two events are related.
>>> 
>>> Recently, a bug was uncovered where a device is gets 'lost'.
>>> 
>>> Here's what happens:
>>> The first message in your "failure on Monday"  log is for
>>> 'mpssas_prepare_remove'.  This message is likely logged because the FW
>> sends
>>> an event to the driver that the device is no longer responsive (pulled,
>>> cable issue, or something else).  When the driver gets this event, it
> sends
>>> a reset to the device to clear out any pending I/O.  This is where the
>>> 'terminated ioc' messages come from.  When the reset completes, the
> driver
>>> is supposed to send a SAS_IO_UNIT message to FW so that the DevHandle
>> for
>>> that disk is removed from FW's list.  Then, when the device comes back
>>> on-line, everything is fine. But, with this bug, before that SAS_IO_UNIT
>>> message is sent to FW, the driver exits the function where that happens
>>> (mpssas_remove_device).  This happens where you see the log message,
>>> "IOCStatus - 0x4b while resetting device 0x0a".  The driver logs that
>>> message and then exits.  What the driver should do is log that message
> and
>>> continue on to send the SAS_IO_UNIT message to FW.  The fix is to remove
>> the
>>> two lines in the driver shown here with '>>':
>>> 
>>> 	if (le16toh(reply->IOCStatus) != MPI2_IOCSTATUS_SUCCESS) {
>>> 		mps_printf(sc, "IOCStatus = 0x%x while resetting device
>>> 0x%x\n",
>>> 		   le16toh(reply->IOCStatus), handle);
>>>>> 		mpssas_free_tm(sc, tm);
>>>>> 		return;
>>> 	}
>>> 
>>> A reboot will solve the problem, as you saw, but the real fix is to
> remove
>>> the DevHandle as described above.  This fix will go into the driver
> before
>>> the next scheduled release and then MFC'd to 10.x.
>>> 
>>> Steve
>>> 
>>>> 
>>>> --
>>>> Dan Langille - BSDCan / PGCon
>>>> dan at langille.org
>>>> 
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> freebsd-scsi at freebsd.org mailing list
>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
>>>> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe at freebsd.org"
>>> 
>> 
>> _______________________________________________
>> freebsd-scsi at freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
>> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe at freebsd.org"
> 



More information about the freebsd-scsi mailing list