Device timeouts(?) with LSI SAS3008 on mpr(4)

Tue Jul 7 15:37:26 UTC 2015

Hi Yamagi,

I see two drives that are having problems.  Are there others?  Can you try
to remove those drives and let me know what happens.  To me, it actually
looks like those drives could be faulty.

Steve

> -----Original Message-----
> From: owner-freebsd-scsi at freebsd.org [mailto:owner-freebsd-
> scsi at freebsd.org] On Behalf Of Yamagi Burmeister
> Sent: Tuesday, July 07, 2015 5:24 AM
> To: freebsd-scsi at freebsd.org
> Subject: Device timeouts(?) with LSI SAS3008 on mpr(4)
>
> Hello,
> I've got 3 new Supermicro servers based upon the X10DRi-LN4+ platform.
> Each server is equiped with 2 LSI SAS9300-8i-SQL SAS adapters. Each
adapter
> serves 8 Intel DC S3700 SSDs. Operating system is 10.1-STABLE as of
r283938 on
> 2 servers and r285196 on the last one.
>
> The controller identify themself as:
>
> ----
>
> mpr0: <Avago Technologies (LSI) SAS3008> port 0x6000-0x60ff mem
> 0xc7240000-0xc724ffff,0xc7200000-0xc723ffff irq 32 at device 0.0 on
> pci2 mpr0: IOCFacts  : MsgVersion: 0x205
>         HeaderVersion: 0x2300
>         IOCNumber: 0
>         IOCExceptions: 0x0
>         MaxChainDepth: 128
>         NumberOfPorts: 1
>         RequestCredit: 10240
>         ProductID: 0x2221
>         IOCRequestFrameSize: 32
>         MaxInitiators: 32
>         MaxTargets: 1024
>         MaxSasExpanders: 42
>         MaxEnclosures: 43
>         HighPriorityCredit: 128
>         MaxReplyDescriptorPostQueueDepth: 65504
>         ReplyFrameSize: 32
>         MaxVolumes: 0
>         MaxDevHandle: 1106
>         MaxPersistentEntries: 128
> mpr0: Firmware: 08.00.00.00, Driver: 09.255.01.00-fbsd
> mpr0: IOCCapabilities:
>
7a85c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,MSIXIndex
> ,HostDisc>
>
> ----
>
> 08.00.00.00 is the last available firmware.
>
>
> Since day one 'dmesg' is cluttered with CAM errors:
>
> ----
>
> mpr1: Sending reset from mprsas_send_abort for target ID 5
>         (da11:mpr1:0:5:0): WRITE(10). CDB: 2a 00 4c 15 1f 88 00 00 08
> 00 length 4096 SMID 554 terminated ioc 804b scsi 0 state c xfer 0
> (da11:mpr1:0:5:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00
> 01 00 00 00 00 00 00 40 06 00 length 512 SMID 506 ter(da11:mpr1:0:5:0):
> READ(10). CDB: 28 00 4c 2b 95 c0 00 00 10 00 minated ioc 804b scsi 0
state c
> xfer 0 (da11:mpr1:0:5:0): CAM status: Command timeout mpr1:
> (da11:Unfreezing devq for target ID 5 mpr1:0:5:0): Retrying command
> (da11:mpr1:0:5:0): READ(10). CDB: 28 00 4c 2b 95 c0 00 00 10 00
> (da11:mpr1:0:5:0): CAM status: SCSI Status Error (da11:mpr1:0:5:0):
> SCSI status: Check Condition (da11:mpr1:0:5:0): SCSI sense: UNIT
ATTENTION
> asc:29,0 (Power on, reset, or bus device reset occurred)
> (da11:mpr1:0:5:0): Retrying command (per sense data) (da11:mpr1:0:5:0):
> READ(10). CDB: 28 00 4c 22 b5 b8 00 00 18 00 (da11:mpr1:0:5:0): CAM
> status: SCSI Status Error (da11:mpr1:0:5:0): SCSI status: Check
Condition
> (da11:mpr1:0:5:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset,
or
> bus device reset occurred) (da11:mpr1:0:5:0): Retrying command (per
sense
> data) (noperiph:mpr1:0:4294967295:0): SMID 2 Aborting command
> 0xfffffe0001601a30
>
> mpr1: Sending reset from mprsas_send_abort for target ID 2
>         (da8:mpr1:0:2:0): WRITE(10). CDB: 2a 00 59 81 ae 18 00 00 30 00
length
> 24576 SMID 898 terminated ioc 804b scsi 0 state c xfer 0
> (da8:mpr1:0:2:0): READ(10). CDB: 28 00 59 77 cc e0 00 00 18 00 length
> 12288 SMID 604 terminated ioc 804b scsi 0 state c xfer 0 mpr1:
> Unfreezing devq for target ID 2 (da8:mpr1:0:2:0): ATA COMMAND PASS
> THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00
> (da8:mpr1:0:2:0): CAM status: Command timeout (da8:mpr1:0:2:0):
> Retrying command (da8:mpr1:0:2:0): WRITE(10). CDB: 2a 00 59 81 ae 18 00
> 00 30 00 (da8:mpr1:0:2:0): CAM status: SCSI Status Error
> (da8:mpr1:0:2:0): SCSI status: Check Condition (da8:mpr1:0:2:0): SCSI
> sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset
> occurred) (da8:mpr1:0:2:0): Retrying command (per sense data)
> (da8:mpr1:0:2:0): READ(10). CDB: 28 00 59 41 3d 08 00 00 10 00
> (da8:mpr1:0:2:0): CAM status: SCSI Status Error (da8:mpr1:0:2:0): SCSI
> status: Check Condition (da8:mpr1:0:2:0): SCSI sense: UNIT ATTENTION
> asc:29,0 (Power on, reset, or bus device reset occurred)
> (da8:mpr1:0:2:0): Retrying command (per sense data)
> (noperiph:mpr1:0:4294967295:0): SMID 3 Aborting command
> 0xfffffe000160b660
>
> ----
>
> ZFS doesn't like this and sees read errors or even write errors. In
extreme cases
> the device is marked as FAULTED:
>
> ----
>
>   pool: examplepool
>  state: DEGRADED
> status: One or more devices are faulted in response to persistent
errors.
> Sufficient replicas exist for the pool to continue functioning in a
degraded state.
> action: Replace the faulted device, or use 'zpool clear' to mark the
device
> repaired.
>   scan: none requested
> config:
>
> 	NAME        STATE     READ WRITE CKSUM
> 	examplepool DEGRADED     0     0     0
> 	  raidz1-0  ONLINE       0     0     0
> 	    da3p1   ONLINE       0     0     0
> 	    da4p1   ONLINE       0     0     0
> 	    da5p1   ONLINE       0     0     0
> 	logs
> 	  da1p1     FAULTED      3     0     0  too many errors
> 	cache
> 	  da1p2     FAULTED      3     0     0  too many errors
> 	spares
> 	  da2p1     AVAIL
>
> errors: No known data errors
>
> ----
>
> The problems arise on all 3 machines all all SSDs nearly daily. So I
highly suspect
> a software issue. Has anyone an idea what's going on and what I can do
to solve
> this problems? More information can be provided if necessary.
>
> Regards,
> Yamagi
>
> --
> Homepage:  www.yamagi.org
> XMPP:      yamagi at yamagi.org
> GnuPG/GPG: 0xEFBCCBCB
> _______________________________________________
> freebsd-scsi at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe at freebsd.org"