Adaptec aac(4) FIB starvation issues on BUS scan
Alexander Sack
pisymbol at gmail.com
Fri Dec 4 23:05:30 UTC 2009
Hey Everyone:
I am running into several issues with the Adaptec aac(4) driver and FIB
starvation during the probe PATH on xpt_bus_scan(). I am on right on a
6.1-amd64 machine with two Adaptec 5085s (6 channel) controllers connected
to external JBODs as well as an MFI controller for the OS itself. Though
this is legacy 6.1, I feel (have to test!) this might also be indicative on
HEAD, RELENG_7-8, etc.
Let me explain:
aac_cam_action() (aac_cam.c):
250 case XPT_PATH_INQ:
251 {
252 struct ccb_pathinq *cpi = &ccb->cpi;
253
254 cpi->version_num = 1;
255 cpi->hba_inquiry = PI_WIDE_16;
256 cpi->target_sprt = 0;
257
258 /* Resetting via the passthrough causes problems. */
259 cpi->hba_misc = PIM_NOBUSRESET;
260 cpi->hba_eng_cnt = 0;
261 cpi->max_target = camsc->inf->TargetsPerBus;
The number of FIBs allocated to this card is 512 (older cards are 256). The
max_target per bus is 287 (0x11F). On a six channel controller with a BUS
scan done in parallel I see a lot of this:
...
(probe501:aacp1:0:214:0): Request Requeued
(probe501:aacp1:0:214:0): Retrying Command
(probe520:aacp1:0:233:0): Request Requeued
(probe520:aacp1:0:233:0): Retrying Command
(probe528:aacp1:0:241:0): Request Requeued
(probe528:aacp1:0:241:0): Retrying Command
(probe540:aacp1:0:253:0): Request Requeued
(probe540:aacp1:0:253:0): Retrying Command
(probe541:aacp1:0:254:0): Request Requeued
(probe541:aacp1:0:254:0): Retrying Command
....
Now, this occurs because during the normal XPT_SCSI_IO request path during
INQUIRY's (I didn't check but its out of the cam_periph_alloc()/probeXXX()
stuff, I'm still debugging it). It runs out of FIBS as per:
311 /* Async ops that require communcation with the controller
*/
312
313 mtx_lock(&sc->aac_io_lock);
314 if (aac_alloc_command(sc, &cm)) {
WE ARE HERE, aac_alloc_command returned EBUSY
315 struct aac_event *event;
316
317 xpt_freeze_simq(sim, 1);
318 ccb->ccb_h.status = CAM_REQUEUE_REQ;
319 xpt_done(ccb);
320 event = malloc(sizeof(struct aac_event), M_AACCAM,
321 M_NOWAIT | M_ZERO);
322 if (event == NULL) {
323 device_printf(sc->aac_dev,
324 "Warning, out of memory for event\n");
325 /* XXX Yuck, what to do here? */
326 mtx_unlock(&sc->aac_io_lock);
327 return;
328 }
329 event->ev_callback = aac_cam_event;
330 event->ev_arg = camsc;
331 event->ev_type = AAC_EVENT_CMFREE;
332 aac_add_event(sc, event);
333 mtx_unlock(&sc->aac_io_lock);
334 return;
335 }
The aac_alloc_command() function tries to grab a free FIB that has been
released back to the driver but none are available (aac_dequeue_free
returned NULL). So what it does is attempt to wakeup its internal working
ktrhead, aac_command_thread(), to go allocate more FIBs. In the meantime,
it freezes the SIMQ and sets CAM_REQUEUE_REQ (unconditionally force CAM to
retry the request) and then allocate an internal event which will free the
simq provided it can allocate a fresh batch of FIBs (iff under the maximum
allowed by the controller).
Questions:
1) What is the downside of this change:
259 cpi->hba_misc = PIM_NOBUSRESET | PIM_SEQSCAN;
This makes the BUS scan an ORDER OF MAGNITUDE faster with no forced
retries. I mean it. Instead of waiting for many many seconds, I wait for
less than a second for targets to come back.
2) Why is CAM_REQUEUE_REQ appropriate? Isn't the driver banking on the
fact that more FIBs will be allocated when at this point you have hit a
resource starvation issue and something like CAM_RESRC_UNAVAIL to throttle
jobs and give time for the either FIBs to be released back and or the
command threads to allocate it. I am going to test this change anyway.
3) In the a similar topic of question 2, FIBs are indeed preallocated:
579 while (sc->total_fibs < AAC_PREALLOCATE_FIBS)
{
580 if (aac_alloc_commands(sc) !=
0)
581
break;
582 }
AAC_PREALLOCATE_FIBS is set to 128. Any reason why not to preallocate all
of them or at least HALF of them (256 of 512)? :D FIBS are 2k in size so
256 of them, 512k, is not THAT much to ask the kernel these days on most
modern systems. By this change does not solve problem one (still too many
requests).
I want to point out that we've seen hangs during probe in our internal labs
but its very hard to reproduce (I'm pretty sure now its due to commands
eternally being retried by CAM due to FIB resource starvation but I am still
trying to prove it).
Any feedback would be most welcomed,
-aps
More information about the freebsd-scsi
mailing list