aac(4) handling of probe when no devices are there
Alexander Sack
pisymbol at gmail.com
Mon Dec 14 22:09:10 UTC 2009
On Mon, Dec 14, 2009 at 4:47 PM, Alexander Sack <pisymbol at gmail.com> wrote:
> Hello Again:
>
> I guess I have a technical question/concern that I was looking for
> feedback. During the probe sequence, aac(4) conditionally responds
> to INQUIRY commands depending on target LUN:
>
> aac_cam.c/aac_cam_complete():
> 532 if (command == INQUIRY) {
> 533 if (ccb->ccb_h.status == CAM_REQ_CMP) {
> 534 device = ccb->csio.data_ptr[0] & 0x1f;
> 535 /*
> 536 * We want DASD and PROC devices to only be
> 537 * visible through the pass device.
> 538 */
> 539 if ((device == T_DIRECT) ||
> 540 (device == T_PROCESSOR) ||
> 541 (sc->flags & AAC_FLAGS_CAM_PASSONLY))
> 542 ccb->csio.data_ptr[0] =
> 543 ((device & 0xe0) | T_NODEVICE);
> 544 } else if (ccb->ccb_h.status ==
> CAM_SEL_TIMEOUT &&
> 545 ccb->ccb_h.target_lun != 0) {
> 546 /* fix for INQUIRYs on Lun>0 */
> 547 ccb->ccb_h.status =
> CAM_DEV_NOT_THERE;
> 548 }
> 549 }
>
> Why is CAM_DEV_NOT_THERE skipped on LUN 0? This is true on my target
> 6.1-amd64 machine as well as CURRENT. The reason why I ask this is
> because now that aac(4) is sequential scanned, there are a lot of cam
> interrupts that come in on my 6.x machine where the threshold is only
> 500 and I get the interrupt storm threshold warning for swi2 pretty
> quickly:
>
> Interrupt storm detected on "swi2:"; throttling interrupt source
>
> Obviously its contingent on the number of adapters you have on your
> system. On CURRENT I didn't see this because the threshold is double
> (I think its a 1000 by default).
>
> The issue is the number of xpt_async(AC_LOST_DEVICE, ..) calls during
> the scan. The probe sequence in CURRENT as well as 6.1 handles
> CAM_SEL_TIMEOUT a little differently depending on context.
>
> scsi_xpt.c/probedone():
> 1090 } else if (cam_periph_error(done_ccb, 0,
> 1091 done_ccb->ccb_h.target_lun > 0
> 1092 ? SF_RETRY_UA|SF_QUIET_IR
> 1093 : SF_RETRY_UA,
> 1094 &softc->saved_ccb) ==
> ERESTART) {
> 1095 return;
> 1096 } else if ((done_ccb->ccb_h.status & CAM_DEV_QFRZN) != 0) {
> 1097 /* Don't wedge the queue */
> 1098 xpt_release_devq(done_ccb->ccb_h.path, /*count*/1,
> 1099 /*run_queue*/TRUE);
> 1100 }
> 1101 /*
> 1102 * If we get to this point, we got an error status back
> 1103 * from the inquiry and the error status doesn't require
> 1104 * automatically retrying the command. Therefore, the
> 1105 * inquiry failed. If we had inquiry information before
> 1106 * for this device, but this latest inquiry command failed,
> 1107 * the device has probably gone away. If this device isn't
> 1108 * already marked unconfigured, notify the peripheral
> 1109 * drivers that this device is no more.
> 1110 */
> 1111 if ((path->device->flags & CAM_DEV_UNCONFIGURED) == 0)
> 1112 /* Send the async notification. */
> 1113 xpt_async(AC_LOST_DEVICE, path, NULL);
> 1114
> 1115 xpt_release_ccb(done_ccb);
> 1116 break;
> 1117 }
>
> But on cam_periph_error(), this will issue a xpt_async(AC_LOST_DEVICE,
> path, NULL) regardless of whether or not the device has been scene
> already (as per the comment above), i.e. on every initial bus scan,
> you will get into (on an aac(4) card with LUN > 0):
>
> cam_periph.c/cam_periph_error():
> 1697 case CAM_SEL_TIMEOUT:
> 1698 {
> .
> .
> 1729 /*
> 1730 * Let peripheral drivers know that this device has gone
> 1731 * away.
> 1732 */
> 1733 xpt_async(AC_LOST_DEVICE, newpath, NULL);
> 1734 xpt_free_path(newpath);
> 1735 break;
>
> Is this really right? This generates A LOT of interrupts noise when no
> devices are attached during the initial scan, i.e. we are treating the
> initial scan of failed INQUIRY commands on the SCSI BUS as if we
> really lost a device during a selection timeout. (we even generate a
> path to issue the async event).
I should have properly titled the thread a little bit better, but
basically we always generate a ton of software CAM interrupts during a
LUN scan for targets on aac(4) that do not really exist (i.e. nothing
is truly there). We do this because we treat the initial INQUIRY sent
down equivalent to a selection timeout instead of the device is not
really there. There seems to be an historical workaround for part of
this issue but I am trying to delve deeper in order to do the *right
thing* for our 6.1 deployments (as well as 7.x and CURRENT).
-aps
More information about the freebsd-scsi
mailing list