Marginal disks prevent boot with mps(4)

Dustin Wenz dustinwenz at ebureau.com
Fri Jun 15 23:06:54 UTC 2012


I just received a SFF-8088->8087 cable via FedEx this morning, which allowed me to continue to isolate this problem.

What I discovered is that it makes no difference whether a bad disk is connected to an expander, or if one is connected directly to the HBA. So, if this is a hardware bug, it must be present in the LSI SAS2008-based HBA that I'm using. The firmware on the card was also upgraded from v11.00.00.00 to v13.00.57.00, which is the latest as far as I am aware. That did not seem to change the behavior.

I did notice that earlier during startup, I see this message a page or so before the endless ioc messages start:
	mps0: polling failed
	mpssas_get_sata_identify: poll for page completed with error 60_mapping_get_dev
	info: failed to compute the hashed SAS address for SATA device with handle 0x0009

It seems that the driver knows something is up; even before it gets stuck later on...

So far, the only way I can get this configuration to boot is to change the status for MPI2_IOCSTATUS_SCSI_IOC_TERMINATED to CAM_REQ_CMP_ERR, as Ken mentioned. That change will still cause the machine to report some "ioc terminated" messages, but will not hang the startup process indefinitely. However, I'm not sure what the implications of making that change on a production machine would be.

If this is LSI's problem, I don't see why they would bother to fix it. As far as I know, they are the only 6Gb SAS/SATA HBA vendor that works on FreeBSD. We have no choice but to buy their stuff, even if it's not robust.

	- .Dustin

On Jun 8, 2012, at 4:53 PM, Kenneth D. Merry wrote:

> On Fri, Jun 08, 2012 at 16:25:31 -0500, Dustin Wenz wrote:
>> I just installed a build of 9.0-STABLE in order to test the changes since release. I was hoping that some of the error-handling in mps would alter the behavior I've seen with some SATA disks (particularly, Seagate ST3000DM001 disks) connected through an LSI SAS 9201-16e HBA.
>> 
> 
> Are you using an expander, or are the disks connected directly to the HBA?
> 
> What firmware version are you using on the HBA?  Make sure you have the
> latest firmware version on the card.
> 
>> It is apparently possible for these disks to get in a state where their presence prevents the machine from booting. This problem has existed for some time, according to some archive-searching I've done, but there isn't much consensus on how to fix it.
>> 
>> The disks are good enough that they can be probed at startup, but some part of initialization cannot complete. This is the message I see repeated forever upon boot (the probe number does change slightly):
>> 
>> 	(probe14:mps0:0:14:0): INQUIRY. CDB: 12 0 0 0 24 0 length 36 SMID 215 terminated ioc 804b scsi 0 state c xfer 0
>> 
>> There is a comment in mps_sas.c which suggests that this error is usually transient, but that seems not to be the case here. Can anyone suggest a modification that might permit booting in this state?
>> 
> 
> There is not a lot that the driver can do in this case.  The command is
> getting terminated by the firmware in the HBA, and we really don't have a
> lot of information to indicate why.
> 
> You could change the status returned for MPI2_IOCSTATUS_SCSI_IOC_TERMINATED
> to CAM_REQ_CMP_ERR, and that would just mean that the probe for that disk
> would eventually fail and the kernel would boot.  CAM_REQUEUE_REQ tells
> CAM to retry the command without decrementing the retry count.  That is
> why you aren't able to boot.
> 
> If upgrading the HBA firmware doesn't fix the problem, I would suggest
> contacting LSI support, and see if they can get additional diagnostics off
> the board to figure out what the problem is.
> 
> Ken
> -- 
> Kenneth Merry
> ken at FreeBSD.ORG



More information about the freebsd-scsi mailing list