mps/LSI SAS2008 controller crashes when smartctl is run with upped
disk tags
Jason Wolfe
nitroboost at gmail.com
Tue Nov 1 18:42:03 UTC 2011
Hello,
I have an issue with the mps driver on 8.2 where running 'smartctl -a'
rarely causes the controller to freak out when disk tags are > 2. I've
confirmed settings the tags to 1 resolves this crash, so that surely is a
clue in the right direction.. I'm using Seagate 1TB SAS drives -
ST91000640SS, and these are SuperMicro X8DTT-H chasis. This happens across
over a thousand servers, so it surely not flaky hardware. It could
obviously be some interoperability with these model drives and the mps
controller, but unfortunately I don't have any other drives deployed on
these cards to test that theory out :/
Luckily remote syslogging is enabled, so while nothing is kept locally, we
see these messages similar to these transmitted before the server hangs,
requiring a power cycle:
(da0:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID
510
(da0:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID
713
(da0:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID
942
(da0:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID
356
(da0:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID
492
(da0:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID
976
(da11:mps0:0:12:0): SCSI command timeout on device handle 0x0015 SMID
339
(da11:mps0:0:12:0): SCSI command timeout on device handle 0x0015 SMID
746
(da5:mps0:0:6:0): SCSI command timeout on device handle 0x000f SMID 74
(da6:mps0:0:7:0): SCSI command timeout on device handle 0x0010 SMID
613
(da2:mps0:0:3:0): SCSI command timeout on device handle 0x000c SMID 16
(da10:mps0:0:11:0): SCSI command timeout on device handle 0x0014 SMID
305
(da1:mps0:0:2:0): SCSI command timeout on device handle 0x000b SMID 74
(da6:mps0:0:7:0): SCSI command timeout on device handle 0x0010 SMID
594
In some cases that would be followed by this, which would usually be the
last transmission, though we don't see this in all cases. It may just be
the system isn't always alive long enough to transmit:
kernel: mps0: IOC Fault 0x40006003, Resetting
I'm able to reproduce fairly easily within a minute or two by heavily
loading the disks up by whatever means, and running smartctl -a in a loop:
#!/bin/sh -x
disks=`sysctl -n kern.disks|xargs -n1|grep ^da`
for disk in $disks; do
camcontrol tags $disk -N 4
done
for z in `yes|head -100`; do
for disk in $disks; do
smartctl -s on -a /dev/$disk
done
done
mps0: <LSI SAS2008> port 0xe000-0xe0ff mem
0xfbd3c000-0xfbd3ffff,0xfbd40000-0xfbd7ffff irq 26 at device 0.0 on pci4
mps0: Firmware: 07.00.00.00
mps0: IOCCapabilities:
1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
mps0: [ITHREAD]
da0 at mps0 bus 0 scbus0 target 1 lun 0
da1 at mps0 bus 0 scbus0 target 2 lun 0
da2 at mps0 bus 0 scbus0 target 3 lun 0
da3 at mps0 bus 0 scbus0 target 4 lun 0
da4 at mps0 bus 0 scbus0 target 5 lun 0
da5 at mps0 bus 0 scbus0 target 6 lun 0
da6 at mps0 bus 0 scbus0 target 7 lun 0
da7 at mps0 bus 0 scbus0 target 8 lun 0
da8 at mps0 bus 0 scbus0 target 9 lun 0
da9 at mps0 bus 0 scbus0 target 10 lun 0
da10 at mps0 bus 0 scbus0 target 11 lun 0
da11 at mps0 bus 0 scbus0 target 12 lun 0
ses0 at mps0 bus 0 scbus0 target 13 lun 0
mps0 at pci0:4:0:0: class=0x010700 card=0x040015d9 chip=0x00721000 rev=0x02
hdr=0x00
vendor = 'LSI Logic (Was: Symbios Logic, NCR)'
class = mass storage
subclass = SAS
<SEAGATE ST91000640SS 0001> at scbus0 target 1 lun 0 (pass0,da0)
<SEAGATE ST91000640SS 0001> at scbus0 target 2 lun 0 (pass1,da1)
<SEAGATE ST91000640SS 0001> at scbus0 target 3 lun 0 (pass2,da2)
<SEAGATE ST91000640SS 0001> at scbus0 target 4 lun 0 (pass3,da3)
<SEAGATE ST91000640SS 0001> at scbus0 target 5 lun 0 (pass4,da4)
<SEAGATE ST91000640SS 0001> at scbus0 target 6 lun 0 (pass5,da5)
<SEAGATE ST91000640SS 0001> at scbus0 target 7 lun 0 (pass6,da6)
<SEAGATE ST91000640SS 0001> at scbus0 target 8 lun 0 (pass7,da7)
<SEAGATE ST91000640SS 0001> at scbus0 target 9 lun 0 (pass8,da8)
<SEAGATE ST91000640SS 0001> at scbus0 target 10 lun 0 (pass9,da9)
<SEAGATE ST91000640SS 0001> at scbus0 target 11 lun 0 (pass10,da10)
<SEAGATE ST91000640SS 0001> at scbus0 target 12 lun 0 (pass11,da11)
<LSI CORP SAS2X28 0717> at scbus0 target 13 lun 0 (ses0,pass12)
Thank you sirs,
Jason Wolfe
More information about the freebsd-scsi
mailing list