aac(4) handling of probe when no devices are there

Mon Dec 14 22:09:10 UTC 2009

On Mon, Dec 14, 2009 at 4:47 PM, Alexander Sack <pisymbol at gmail.com> wrote:
> Hello Again:
>
> I guess I have a technical question/concern that I was looking for
> feedback.   During the probe sequence, aac(4) conditionally responds
> to INQUIRY commands depending on target LUN:
>
> aac_cam.c/aac_cam_complete():
> 532                         if (command == INQUIRY) {
> 533                                 if (ccb->ccb_h.status == CAM_REQ_CMP) {
> 534                                 device = ccb->csio.data_ptr[0] & 0x1f;
> 535                                 /*
> 536                                  * We want DASD and PROC devices to only be
> 537                                  * visible through the pass device.
> 538                                  */
> 539                                 if ((device == T_DIRECT) ||
> 540                                     (device == T_PROCESSOR) ||
> 541                                     (sc->flags & AAC_FLAGS_CAM_PASSONLY))
> 542                                         ccb->csio.data_ptr[0] =
> 543                                             ((device & 0xe0) | T_NODEVICE);
> 544                                 } else if (ccb->ccb_h.status ==
> CAM_SEL_TIMEOUT &&
> 545                                         ccb->ccb_h.target_lun != 0) {
> 546                                         /* fix for INQUIRYs on Lun>0 */
> 547                                         ccb->ccb_h.status =
> CAM_DEV_NOT_THERE;
> 548                                 }
> 549                         }
>
> Why is CAM_DEV_NOT_THERE skipped on LUN 0?  This is true on my target
> 6.1-amd64 machine as well as CURRENT.  The reason why I ask this is
> because now that aac(4) is sequential scanned, there are a lot of cam
> interrupts that come in on my 6.x machine where the threshold is only
> 500 and I get the interrupt storm threshold warning for swi2 pretty
> quickly:
>
> Interrupt storm detected on "swi2:"; throttling interrupt source
>
> Obviously its contingent on the number of adapters you have on your
> system.  On CURRENT I didn't see this because the threshold is double
> (I think its a 1000 by default).
>
> The issue is the number of xpt_async(AC_LOST_DEVICE, ..) calls during
> the scan.  The probe sequence in CURRENT as well as 6.1 handles
> CAM_SEL_TIMEOUT a little differently depending on context.
>
> scsi_xpt.c/probedone():
> 1090                 } else if (cam_periph_error(done_ccb, 0,
> 1091                                             done_ccb->ccb_h.target_lun > 0
> 1092                                             ? SF_RETRY_UA|SF_QUIET_IR
> 1093                                             : SF_RETRY_UA,
> 1094                                             &softc->saved_ccb) ==
> ERESTART) {
> 1095                         return;
> 1096                 } else if ((done_ccb->ccb_h.status & CAM_DEV_QFRZN) != 0) {
> 1097                         /* Don't wedge the queue */
> 1098                         xpt_release_devq(done_ccb->ccb_h.path, /*count*/1,
> 1099                                          /*run_queue*/TRUE);
> 1100                 }
> 1101                 /*
> 1102                  * If we get to this point, we got an error status back
> 1103                  * from the inquiry and the error status doesn't require
> 1104                  * automatically retrying the command.  Therefore, the
> 1105                  * inquiry failed.  If we had inquiry information before
> 1106                  * for this device, but this latest inquiry command failed,
> 1107                  * the device has probably gone away.  If this device isn't
> 1108                  * already marked unconfigured, notify the peripheral
> 1109                  * drivers that this device is no more.
> 1110                  */
> 1111                 if ((path->device->flags & CAM_DEV_UNCONFIGURED) == 0)
> 1112                         /* Send the async notification. */
> 1113                         xpt_async(AC_LOST_DEVICE, path, NULL);
> 1114
> 1115                 xpt_release_ccb(done_ccb);
> 1116                 break;
> 1117         }
>
> But on cam_periph_error(), this will issue a xpt_async(AC_LOST_DEVICE,
> path, NULL) regardless of whether or not the device has been scene
> already (as per the comment above), i.e. on every initial bus scan,
> you will get into (on an aac(4) card with LUN > 0):
>
> cam_periph.c/cam_periph_error():
> 1697         case CAM_SEL_TIMEOUT:
> 1698         {
> .
> .
> 1729                 /*
> 1730                  * Let peripheral drivers know that this device has gone
> 1731                  * away.
> 1732                  */
> 1733                 xpt_async(AC_LOST_DEVICE, newpath, NULL);
> 1734                 xpt_free_path(newpath);
> 1735                 break;
>
> Is this really right? This generates A LOT of interrupts noise when no
> devices are attached during the initial scan, i.e. we are treating the
> initial scan of failed INQUIRY commands on the SCSI BUS as if we
> really lost a device during a selection timeout.  (we even generate a
> path to issue the async event).

I should have properly titled the thread a little bit better, but
basically we always generate a ton of software CAM interrupts during a
LUN scan for targets on aac(4) that do not really exist (i.e. nothing
is truly there).  We do this because we treat the initial INQUIRY sent
down equivalent to a selection timeout instead of the device is not
really there.  There seems to be an historical workaround for part of
this issue but I am trying to delve deeper in order to do the *right
thing* for our 6.1 deployments (as well as 7.x and CURRENT).

-aps