Avago LSI SAS 3008 & Intel SSD Timeouts

list-news list-news at mindpackstudios.com
Tue Jun 7 19:53:27 UTC 2016


I don't believe the mainboard has any SATA ports.  It does have a PCIe 
slot IIRC though, and I may be able to rig something up with another LSI 
adapter I have laying around.  If I can get it to fit and find a way to 
power the drives.

Although, this seems unlikely unless you are seeing something I'm not?

With that last test: If it's the SAS controller, 3 different ones 
running two different firmware versions are all causing the issue.  If 
it's the backplane, I have now tested 3 of them as well, two of which I 
can confirm have different revision numbers.

Errors never appear with tags set to 1 for each drive (effectively 
eliminating NCQ as I understand it).  My brief understanding is that a 
higher tag count allows the SAS adapter to send more commands to the 
drive in parallel, allowing the drive to make the decisions about 
command ordering.  If that is accurate, and the controller firmware was 
bad, I assume this would be a far more common bug that would have been 
fixed already.

On the other hand, if it only happens during heavy SYNCHRONIZE CACHE 
commands in parallel on certain Intel SSD's and only on controllers 
(maybe 12gbps?) that can outrun the drive firmware or cause a race 
condition (my suspicions here).  It seems far more likely this would 
have gone unnoticed by Intel.

-Kyle


On 6/7/16 2:02 PM, Steven Hartland wrote:
> Have you tried direct attaching the drives?
>
> On 07/06/2016 18:09, list-news wrote:
>> The system is a Twin.  In the first post I mentioned this but I 
>> probably wasn't clear.
>>
>> The twin unit is this one:
>> https://www.supermicro.com/products/system/2u/2028/sys-2028tp-decr.cfm
>>
>> I've used all components from twin node A and B (cpu / memory / 
>> mainboard / controller).  I still get the errors.  The backplane was 
>> the original thought of concern, and that has been RMA'd and replaced 
>> - errors continue.  I've even swapped out power supplies with another 
>> identical unit I have here.
>>
>> In every case the errors continue, until I do this:
>> #camcontrol daX -N 1
>> (for each drive in the zpool)
>>
>> Then the errors stop.
>>
>> The system errors every few minutes while my application is running.  
>> Set tags to -N 1, and everything goes quiet.  16 cores at 100% cpu 
>> and drives 80% busy @ ~15k IO p/s, for about 5 hours solid before it 
>> finishes a batch, no errors are reported with -N set to 1.  If I set 
>> tags with -N 255 for each device, errors start again within 5 
>> minutes, and continue every 2-5 minutes, until the batch is finished.
>>
>> -Kyle
>>
>>> I would try, if possible, to swap the controller.
>>>
>>>
>>>
>>>
>>>
>>>
>>> Borja.
>>>
>>>
>>
>> _______________________________________________
>> freebsd-scsi at freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
>> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe at freebsd.org"
>
> _______________________________________________
> freebsd-scsi at freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe at freebsd.org"




More information about the freebsd-scsi mailing list