mps/LSI SAS2008 controller crashes when smartctl is run with upped disk tags

Peter Maloney peter.maloney at brockmann-consult.de
Tue Nov 1 19:18:07 UTC 2011


Dear Jason,

Am 01.11.2011 19:13, schrieb Jason Wolfe:
> Hello,
>
> I have an issue with the mps driver on 8.2 where running 'smartctl -a'
> rarely causes the controller to freak out when disk tags are > 2.  I've
> confirmed settings the tags to 1 resolves this crash, so that surely is a
> clue in the right direction..  I'm using Seagate 1TB SAS drives -
> ST91000640SS, and these are SuperMicro X8DTT-H chasis.  This happens across
> over a thousand servers, so it surely not flaky hardware.  It could
> obviously be some interoperability with these model drives and the mps
> controller, but unfortunately I don't have any other drives deployed on
> these cards to test that theory out :/
I get a simlar problem on a system with an LSI 9211-8i with 20 SATA
disks attached (2 SSDs and 18 spnning disks). My system doesn't hang,
panic, or reset though. I just lose access to one disk, which is then
considered FAULTED in my zpool status (with the ZFS file system). If I
physically remove the FAULTED disk and run "gpart recover da0", I get a
panic. Otherwise, the system keeps running in a degraded state.  When I
reboot and resilver, some data is found damaged and repaired, not just
refreshed with the latest state. The server has 1 HBA and 2 backplanes,
and I have the 2 mirrored root disks on different backplanes. Maybe that
is why mine runs degraded and yours hang.

This happened twice so far (in around a month or two), and both times it
was one of the mirrored root disks (SSDs) that faulted.

My tags are set to 255. I will try reproducing it as you said, and then
if it fails, rebooting and trying again setting tags to 2 as you suggested.

And *thank you very much for this information*. This is the last
outstanding issue with this server. I hope this workaround helps.

# camcontrol tags /dev/da0
(pass0:mps0:0:7:0): device openings: 255

>
> Luckily remote syslogging is enabled, so while nothing is kept locally, we
> see these messages similar to these transmitted before the server hangs,
> requiring a power cycle:
>
> (da0:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID
> 510
> (da0:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID
> 713
> (da0:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID
> 942
> (da0:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID
> 356
> (da0:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID
> 492
> (da0:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID
> 976
> (da11:mps0:0:12:0): SCSI command timeout on device handle 0x0015 SMID
> 339
> (da11:mps0:0:12:0): SCSI command timeout on device handle 0x0015 SMID
> 746
> (da5:mps0:0:6:0): SCSI command timeout on device handle 0x000f SMID 74
> (da6:mps0:0:7:0): SCSI command timeout on device handle 0x0010 SMID
> 613
> (da2:mps0:0:3:0): SCSI command timeout on device handle 0x000c SMID 16
> (da10:mps0:0:11:0): SCSI command timeout on device handle 0x0014 SMID
> 305
> (da1:mps0:0:2:0): SCSI command timeout on device handle 0x000b SMID 74
> (da6:mps0:0:7:0): SCSI command timeout on device handle 0x0010 SMID
> 594
>
> In some cases that would be followed by this, which would usually be the
> last transmission, though we don't see this in all cases.  It may just be
> the system isn't always alive long enough to transmit:
>
> kernel: mps0: IOC Fault 0x40006003, Resetting
>
>
> I'm able to reproduce fairly easily within a minute or two by heavily
> loading the disks up by whatever means, and running smartctl -a in a loop:
>
> #!/bin/sh -x
>
> disks=`sysctl -n kern.disks|xargs -n1|grep ^da`
>
> for disk in $disks; do
> camcontrol tags $disk -N 4
> done
>
> for z in `yes|head -100`; do
> for disk in $disks; do
> smartctl -s on -a /dev/$disk
> done
> done
>
> mps0: <LSI SAS2008> port 0xe000-0xe0ff mem
> 0xfbd3c000-0xfbd3ffff,0xfbd40000-0xfbd7ffff irq 26 at device 0.0 on pci4
> mps0: Firmware: 07.00.00.00
> mps0: IOCCapabilities:
> 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
> mps0: [ITHREAD]
> da0 at mps0 bus 0 scbus0 target 1 lun 0
> da1 at mps0 bus 0 scbus0 target 2 lun 0
> da2 at mps0 bus 0 scbus0 target 3 lun 0
> da3 at mps0 bus 0 scbus0 target 4 lun 0
> da4 at mps0 bus 0 scbus0 target 5 lun 0
> da5 at mps0 bus 0 scbus0 target 6 lun 0
> da6 at mps0 bus 0 scbus0 target 7 lun 0
> da7 at mps0 bus 0 scbus0 target 8 lun 0
> da8 at mps0 bus 0 scbus0 target 9 lun 0
> da9 at mps0 bus 0 scbus0 target 10 lun 0
> da10 at mps0 bus 0 scbus0 target 11 lun 0
> da11 at mps0 bus 0 scbus0 target 12 lun 0
> ses0 at mps0 bus 0 scbus0 target 13 lun 0
>
> mps0 at pci0:4:0:0: class=0x010700 card=0x040015d9 chip=0x00721000 rev=0x02
> hdr=0x00
> vendor = 'LSI Logic (Was: Symbios Logic, NCR)'
> class = mass storage
> subclass = SAS
>
> <SEAGATE ST91000640SS 0001> at scbus0 target 1 lun 0 (pass0,da0)
> <SEAGATE ST91000640SS 0001> at scbus0 target 2 lun 0 (pass1,da1)
> <SEAGATE ST91000640SS 0001> at scbus0 target 3 lun 0 (pass2,da2)
> <SEAGATE ST91000640SS 0001> at scbus0 target 4 lun 0 (pass3,da3)
> <SEAGATE ST91000640SS 0001> at scbus0 target 5 lun 0 (pass4,da4)
> <SEAGATE ST91000640SS 0001> at scbus0 target 6 lun 0 (pass5,da5)
> <SEAGATE ST91000640SS 0001> at scbus0 target 7 lun 0 (pass6,da6)
> <SEAGATE ST91000640SS 0001> at scbus0 target 8 lun 0 (pass7,da7)
> <SEAGATE ST91000640SS 0001> at scbus0 target 9 lun 0 (pass8,da8)
> <SEAGATE ST91000640SS 0001> at scbus0 target 10 lun 0 (pass9,da9)
> <SEAGATE ST91000640SS 0001> at scbus0 target 11 lun 0 (pass10,da10)
> <SEAGATE ST91000640SS 0001> at scbus0 target 12 lun 0 (pass11,da11)
> <LSI CORP SAS2X28 0717> at scbus0 target 13 lun 0 (ses0,pass12)
>
> Thank you sirs,
>
> Jason Wolfe
> _______________________________________________
> freebsd-scsi at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe at freebsd.org"


And my logs to compare:

(note my root, swap, zfs cache, and zfs log is on the disk that fails).
)

root at bcnas1:/var/log# swapinfo
Device          1K-blocks     Used    Avail Capacity
/dev/gpt/swap0     524288     5840   518448     1%
/dev/gpt/swap1     524288     5640   518648     1%
Total             1048576    11480  1037096     1%

When it starts happening, it looks like this:
Oct 29 00:02:12 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout
on device handle 0x0016 SMID 220
Oct 29 00:02:12 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout
on device handle 0x0016 SMID 87
Oct 29 00:02:12 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout
on device handle 0x0016 SMID 795
Oct 29 00:02:12 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout
on device handle 0x0016 SMID 423
Oct 29 00:02:12 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout
on device handle 0x0016 SMID 338
Oct 29 02:19:12 bcnas1 kernel: :9:0): SCSI command timeout on device
handle 0x0016 SMID 170
Oct 29 02:19:12 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout
on device handle 0x0016 SMID 637
Oct 29 02:19:12 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout
on device handle 0x0016 SMID 335
Oct 29 02:19:12 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout
on device handle 0x0016 SMID 798
Oct 29 02:19:12 bcnas1 kernel: mps0: (0:9:0) terminated ioc 804b scsi 0
state c xfer 0
Oct 29 02:19:12 bcnas1 last message repeated 14 times
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_abort_complete: abort
request on handle 0x16 SMID 991 complete
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending
deferred task management request for handle 0x16 SMID 4
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_abort_complete: abort
request on handle 0x16 SMID 4 complete
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending
deferred task management request for handle 0x16 SMID 227
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_abort_complete: abort
request on handle 0x16 SMID 227 complete
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending
deferred task management request for handle 0x16 SMID 652
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_abort_complete: abort
request on handle 0x16 SMID 652 complete
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending
deferred task management request for handle 0x16 SMID 125
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_abort_complete: abort
request on handle 0x16 SMID 125 complete
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending
deferred task management request for handle 0x16 SMID 101
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_abort_complete: abort
request on handle 0x16 SMID 1017 complete
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending
deferred task management request for handle 0x16 SMID 100
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_abort_complete: abort
request on handle 0x16 SMID 1004 complete
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending
deferred task management request for handle 0x16 SMID 487
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_abort_complete: abort
request on handle 0x16 SMID 487 complete
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending
deferred task management request for handle 0x16 SMID 279
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_abort_complete: abort
request on handle 0x16 SMID 279 complete
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending
deferred task management request for handle 0x16 SMID 929
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_abort_complete: abort
request on handle 0x16 SMID 929 complete
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending
deferred task management request for handle 0x16 SMID 346
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_abort_complete: abort
request on handle 0x16 SMID 346 complete
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending
deferred task management request for handle 0x16 SMID 817
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_abort_complete: abort
request on handle 0x16 SMID 817 complete
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending
deferred task management request for handle 0x16 SMID 170
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_abort_complete: abort
request on handle 0x16 SMID 170 complete
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending
deferred task management request for handle 0x16 SMID 637
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_abort_complete: abort
request on handle 0x16 SMID 637 complete
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending
deferred task management request for handle 0x16 SMID 335
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_abort_complete: abort
request on handle 0x16 SMID 335 complete
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending
deferred task management request for handle 0x16 SMID 798
Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_abort_complete: abort
request on handle 0x16 SMID 798 complete
Oct 29 02:19:12 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout
on device handle 0x0016 SMID 757
Oct 29 02:19:12 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout
on device handle 0x0016 SMID 833
Oct 29 02:19:12 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout
on device handle 0x0016 SMID 804
Oct 29 02:19:12 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout
on device handle 0x0016 SMID 464
Oct 29 02:19:12 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout
on device handle 0x0016 SMID 144
Oct 29 02:19:12 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout
on device handle 0x0016 SMID 912
Oct 29 02:19:12 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout
on device handle 0x0016 SMID 753
Oct 29 02:19:12 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout
on device handle 0x0016 SMID 422
Oct 29 02:19:12 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout
on device handle 0x0016 SMID 241

And then just before I rebooted it, basically looked the same, with the
different messages mixed together:

ct 31 07:52:12 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending
deferred task management request for handle 0x16 SMID 807
ct 31 07:52:12 bcnas1 kernel: mps0: mpssas_abort_complete: abort request
on handle 0x16 SMID 807 complete
ct 31 07:53:12 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout on
device handle 0x0016 SMID 1006
ct 31 07:53:12 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout on
device handle 0x0016 SMID 111
ct 31 07:53:20 bcnas1 kernel: mps0: (0:9:0) terminated ioc 804b scsi 0
state c xfer 0
ct 31 07:53:20 bcnas1 kernel: mps0: mpssas_abort_complete: abort request
on handle 0x16 SMID 1006 complete
ct 31 07:53:20 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending
deferred task management request for handle 0x16 SMID 111
ct 31 07:53:20 bcnas1 kernel: mps0: mpssas_abort_complete: abort request
on handle 0x16 SMID 111 complete
ct 31 07:54:20 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout on
device handle 0x0016 SMID 669
ct 31 07:54:20 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout on
device handle 0x0016 SMID 912
ct 31 07:54:28 bcnas1 kernel: mps0: (0:9:0) terminated ioc 804b scsi 0
state c xfer 0
ct 31 07:54:28 bcnas1 kernel: mps0: mpssas_abort_complete: abort request
on handle 0x16 SMID 669 complete
ct 31 07:54:28 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending
deferred task management request for handle 0x16 SMID 912
ct 31 07:54:28 bcnas1 kernel: mps0: mpssas_abort_complete: abort request
on handle 0x16 SMID 912 complete
ct 31 07:55:29 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout on
device handle 0x0016 SMID 804
ct 31 07:55:29 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout on
device handle 0x0016 SMID 1001
ct 31 07:55:36 bcnas1 kernel: mps0: (0:9:0) terminated ioc 804b scsi 0
state c xfer 0
ct 31 07:55:36 bcnas1 kernel: mps0: mpssas_abort_complete: abort request
on handle 0x16 SMID 804 complete
ct 31 07:55:36 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending
deferred task management request for handle 0x16 SMID 1001
ct 31 07:55:36 bcnas1 kernel: mps0: mpssas_abort_complete: abort request
on handle 0x16 SMID 1001 complete
ct 31 07:56:36 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout on
device handle 0x0016 SMID 389
ct 31 07:56:36 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout on
device handle 0x0016 SMID 885
ct 31 07:56:44 bcnas1 kernel: mps0: (0:9:0) terminated ioc 804b scsi 0
state c xfer 0
ct 31 07:56:44 bcnas1 kernel: mps0: mpssas_abort_complete: abort request
on handle 0x16 SMID 389 complete
ct 31 07:56:44 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending
deferred task management request for handle 0x16 SMID 885
ct 31 07:56:44 bcnas1 kernel: swap_pager: I/O error - pageout failed;
blkno 131393,size 65536, error 5
ct 31 07:56:44 bcnas1 kernel: mps0: mpssas_abort_complete: abort request
on handle 0x16 SMID 885 complete
ct 31 07:57:45 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout on
device handle 0x0016 SMID 442
ct 31 07:57:48 bcnas1 kernel: mps0: mpssas_abort_complete: abort request
on handle 0x16 SMID 442 complete
ct 31 07:58:49 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout on
device handle 0x0016 SMID 413
ct 31 07:58:52 bcnas1 kernel: mps0: mpssas_abort_complete: abort request
on handle 0x16 SMID 413 complete
ct 31 07:59:53 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout on
device handle 0x0016 SMID 90
ct 31 07:59:56 bcnas1 kernel: mps0: mpssas_abort_complete: abort request
on handle 0x16 SMID 90 complete
ct 31 08:00:56 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout on
device handle 0x0016 SMID 504
ct 31 08:01:00 bcnas1 kernel: mps0: mpssas_abort_complete: abort request
on handle 0x16 SMID 504 complete
ct 31 08:02:01 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout on
device handle 0x0016 SMID 861
ct 31 08:02:04 bcnas1 kernel: mps0: swap_pager: I/O error - pageout
failed; blkno 131409,size 49152, error 5mpssas_abort_complete: abort
request on handle 0x16
SMID 861 complete



More information about the freebsd-scsi mailing list