Low-level trace-buffers in CAM

Tue Nov 4 20:20:20 UTC 2014

Re-posting something that I sent to freebsd-geom@ last night; I didn't see
a freebsd-cam@ list, but I was told this morning that CAM issues should go
to freebsd-scsi at . Also adding some known-interested parties.

Thanks,

Ravi
----------------

Hi folks,

At Panasas, we found it useful to have a trace buffer for each ATA drive
in the system, for gathering information about command sequences and
latency. I implemented that functionality for our old 7-ish ATA driver,
and it works quite well, with fairly low overhead. For example (sorry for
the really long lines...):

# Tabular format for use w/ other tools
# Request time       ATA request information   Response time        ATA
response information  sts/err  flags
1415043680586349     25 0000 0200 000001b8c200 1415043680587357     25
0000 0200 000001b8c200 50 00 0000007d
1415043680587389     25 0000 0200 000001b8c400 1415043680588396     25
0000 0200 000001b8c400 50 00 0000007d
1415043680588430     25 0000 0200 000001b8c600 1415043680589406     25
0000 0200 000001b8c600 50 00 0000007d
1415043681203051     ca 0000 0008 000000000480 1415043681203279     ca
0000 0008 000000000480 50 00 0000007d
1415043681203284     ca 0000 0008 000000002f00 1415043681204022     ca
0000 0008 000000002f00 50 00 0000007d
1415043681596331     25 0000 0200 000001b8c800 1415043681597389     25
0000 0200 000001b8c800 50 00 0000007d
1415043681597423     25 0000 0200 000001b8ca00 1415043681598434     25
0000 0200 000001b8ca00 50 00 0000007d
1415043681598466     25 0000 0200 000001b8cc00 1415043681599484     25
0000 0200 000001b8cc00 50 00 0000007d
1415043681599517     25 0000 0200 000001b8ce00 1415043681600515     25
0000 0200 000001b8ce00 50 00 0000007d
1415043681934541     25 0000 0200 000000586600 1415043681935579     25
0000 0200 000000586600 50 00 0000007d

# Human-readable
FLAGS                            REQUEST TIME         RESPONSE TIME
ELAPSED TIME         REQUEST COMMAND
                           STATUS   ERROR    RESPONSE COMMAND

-------------------------------- -------------------- --------------------
--------------------
---------------------------------------------------------------------------
----- -------- --------
---------------------------------------------------------------------------
-----
____C...._V                      1415043680586349     1415043680587357
1008                 READ DMA EXT (LBA 28885504 + 512 sectors)
                           _R_S____ ________ ----

____C...._V                      1415043680587389     1415043680588396
1007                 READ DMA EXT (LBA 28886016 + 512 sectors)
                           _R_S____ ________ ----

____C...._V                      1415043680588430     1415043680589406
976                  READ DMA EXT (LBA 28886528 + 512 sectors)
                           _R_S____ ________ ----

____C...._V                      1415043681203051     1415043681203279
228                  WRITE DMA (LBA 1152 + 8 sectors)
                           _R_S____ ________ ----

____C...._V                      1415043681203284     1415043681204022
738                  WRITE DMA (LBA 12032 + 8 sectors)
                           _R_S____ ________ ----

____C...._V                      1415043681596331     1415043681597389
1058                 READ DMA EXT (LBA 28887040 + 512 sectors)
                           _R_S____ ________ ----

____C...._V                      1415043681597423     1415043681598434
1011                 READ DMA EXT (LBA 28887552 + 512 sectors)
                           _R_S____ ________ ----

____C...._V                      1415043681598466     1415043681599484
1018                 READ DMA EXT (LBA 28888064 + 512 sectors)
                           _R_S____ ________ ----

As we start moving toward a  more modern base release, we want to
re-implement that functionality for ATA, SCSI, NVMe, etc. And push it
upsream.

The obvious place to put this sort of tracing would be in GEOM, but that
actually isn't what we want, for a few reasons. First of all, we're
primarily interested in the exact register / taskfile / CDB values sent to
the drive - opcode, feature codes, LBAs, etc. Second, we're interested in
hardware-level latency - how long does a request stay in the SIM and
PERIPH. Both of those pretty much require working below GEOM.

Therefore, I think we want to do it in the CAM stack; specifically, the
hooks would need to go into some of the SIMs. (Yes, I know, this requires
touching every SIM which we want to use. It's not optimal, but the hooks
should be pretty lightweight and easy to add, and nothing will break if a
given SIM isn't updated.)

I took our implementation, and pseudo-coded what it might look like in
CAM. (I could probably make diffs against the old ATA stack available for
comparison, but I figured no one would care anymore.) A huge caveat is
that I've never really poked in CAM before, so I fully expect I'm doing
several things wrong. Please educate me. :-)

================================

Create a circular buffer of trace records for each CAM PERIPH, getting
information as close to the device as possible. Each record will contain
the full CDB for the request as sent to the device, the timestamp (in
usec) of the request, the CDB as returned by the device, the timestamp of
the response, and the status and error code. Additionally, flags are
provided to track the state of the trace record, and to indicate which
type of periph the record is associated with.

typedef struct {
	u_int32_t	flags;
#define CAM_TRACE_F_VALID	0x00000001	/* valid record */
#define CAM_TRACE_F_INPROGRESS	0x00000002	/* request in progress */
#define CAM_TRACE_F_REQ_VALID	0x00000004	/* request data valid */
#define CAM_TRACE_F_RESP_VALID	0x00000008	/* response data valid */
#define CAM_TRACE_F_COMPLETE	0x00000010	/* response gathered */
#define CAM_TRACE_F_TIMEDOUT	0x00000020	/* request timed out */
#define CAM_TRACE_F_ABANDONED	0x00000040	/* cancelled, etc. */
#define CAM_TRACE_F_RETRIED	0x00000080	/* retry in later record */
#define CAM_TRACE_F_IS_RETRY	0x00000100	/* retry of earlier req. */
#define CAM_TRACE_F_PERIPH_MASK	0xf0000000	/* up to 16 types of PERIPHs */
#define CAM_TRACE_F_PERIPH_DA	0x0
#define CAM_TRACE_F_PERIPH_ADA	0x1
#define CAM_TRACE_F_PERIPH_NVD	0x2
	u_int64_t	req_usec;
	cdb_t		req_cdb;
	u_int64_t	resp_usec;
	cdb_t		resp_cdb;
	u_int8_t	resp_status;
	u_int8_t	resp_error;
	u_int8_t	resp_asc;	/* Needed for SCSI? */
	u_int8_t	resp_asq;	/* Needed for SCSI? */
	u_int8_t	padding[???];	/* Pad for alignment? */
/* There's going to be a large-ish array of these in the kernel; make sure
they're packed and aligned to minimize space usage and keep things
aligned. The original implementation was 48 bytes, which was already
aligned. */
} cam_trace_record_t __attribute__((packed, aligned(8)));

typedef struct {
	u_int32_t		trace_buf_record_count;
	cam_trace_record_t	*trace_buf;
} cam_trace_buffer_t;
#define CAM_TRACE_BUFFER_DEFAULT_COUNT 100000

----------------

ioctl interfaces are used to get the trace buffer size, to retrieve the
buffer contents, or to clear the buffer. The buffer can then be analyzed
from userspace, or written to a file for post-processing.

#define CAM_GET_TRACE_RECORD_COUNT	_IOR(???, ???, u_int32_t)
#define CAM_GET_TRACE_BUFFER		_IOWR(???, ???, cam_trace_buffer_t)
#define CAM_CLEAR_TRACE_BUFFER		_IOW(???, ???, u_int32_t)

----------------

To get the buffer:

	cam_trace_buffer_t trace = NULL;
	fd = open(periph_name, O_RDONLY);
	error = ioctl(fd, CAM_GET_TRACE_RECORD_COUNT,
&trace.trace_buf_record_count);
	trace.trace_buf = malloc(record_count * sizeof (cam_trace_record_t));
	error = ioctl(fd, CAM_GET_TRACE_BUFFER, &trace);

By including the full CDB, the entire command can be decoded by analysis
tools. Tools can be written to parse the buffer and translate command data
into human-readable descriptions, calculate elapsed times for requests,
perform statistical analysis based on opcode, command type, transfer size,
etc. Command decoders would need to be written for each PERIPH type (i.e.
one for ATA, one for SCSI, one for NVMe, etc.).

----------------

Each PERIPH's softc needs to include some information about the per-device
buffer.

	u_int32_t		trace_buf_record_count;
	u_int64_t		trace_buf_request_count;
	cam_trace_record_t	*trace_buf;

In each SIM, there should already be some sort of request structure, which
associates the softc and the CDB of the request. To that structure, add a
pointer to the trace record associated with the request.

request structure:
	cam_trace_record_t	*trace_rec;

----------------

When initializing the per-device softc, allocate a trace buffer. The
number of records can be hard-coded, or settable via a tunable.

	/** Initialize per-drive command tracing buffer. **/
	/* Get the buffer size from the tunable, if present. Sanity check it. */
	sc->trace_buf_request_count = 0;
	sc->trace_buf_record_count = CAM_TRACE_BUFFER_DEFAULT_COUNT;
	TUNABLE_ULONG_FETCH("cam.trace_buf_record_count",
&trace_buf_record_count_tunable);
	if (trace_buf_record_count_tunable) {
		/* Less than 1K records is unusably few; more than 1M records is
unusably many. */
		if ((trace_buf_record_count_tunable < (1 << 10)) ||
(trace_buf_record_count_tunable > (1 << 20))) {
			device_printf(dev, "cam.trace_buf_record_count=%lu" " out of range;
should be between %u and %u\n", trace_buf_record_count_tunable, (1 << 10),
(1 << 20));
		} else {
			sc->trace_buf_record_count = (u_int32_t) trace_buf_record_count_tunable;
		}
	}
	/* Allocate the buffer. */
	trace_buf_bytes = sc->trace_buf_record_count * sizeof(cam_trace_record_t);
	sc->trace_buf = malloc(trace_buf_bytes, M_AD, M_NOWAIT | M_ZERO);
	if (sc->trace_buf == NULL) {
		device_printf(dev, "unable to allocate %zd for trace_buf\n",
trace_buf_bytes);
		sc->trace_buf_record_count = 0;
	} else {
		device_printf(dev, "allocated %zd @ %p for trace_buf\n",
trace_buf_bytes, sc->trace_buf);
	}

----------------

/* Initial setup of the trace record for a new request.
 *
 *  Locate the circular per-device trace buffer in the drive's softc.
 *  Populate the pointer to the next record in the trace buffer; this is
 *  done atomically to prevent concurrent requests from getting the same
 *  index. That could get a bit weird if/when 'trace_buf_request_count'
 *  rolls over, if the maximum counter value is not a multiple of
 *  'trace_buf_record_count' (i.e. the request counter recycles in the
 *  middle of the buffer). By using a 64-bit request counter, we're pretty
 *  safe - even at one transaction per nanosecond, that will still last
 *  about 585 years.
 *  Initialize the record by zeroing it and then setting the VALID flag.
 *
 *  This could be done at any point before sending the request to the
 *  hardware.
 */
void
trace_rec_init(cam_request_t req) {
	ada_softc *sc = (request->...->softc)
	u_int64_t tmp = 0;
	if (sc && sc->trace_buf) {
		tmp = atomic_fetchadd_log(&sc->trace_buf_request_count, 1);
		req->trace_rec = &sc->trace_buf[tmp % sc->trace_buf_record_count];
		bzero(req->trace_rec, sizeof(*req->trace_rec));
		req->trace_rec->flags = (CAM_TRACE_F_PERIPH_XXX << 28);
		req->trace_rec->flags |= CAM_TRACE_F_VALID;
	} else {
		req->trace_rec = NULL;
	}
}

----------------

/* Stuff the request data into the trace record.
 *
 *  If the trace record is present and valid, copy the CDB into the
 *  record. Set CAM_TRACE_F_REQ_VALID to indicate that the request data is
 *  valid, and INPROGRESS to show that the request is in-progress.
 *
 *  This should be done just before sending the request to the hardware.
 *  This should *not* be fused into trace_rec_init(), because
 *  trace_rec_init() is also used in the retry case, where it is not
correct
 *  to immediately start the timer.
 */
void
trace_rec_req(cam_request_t req) {
	struct timeval tv;
	if (req->trace_rec && (req->trace_rec->flags & CAM_TRACE_F_VALID)) {
		bcopy(req->cdb, req->trace_rec->req_cdb, sizeof(req->cdb));
		microuptime(&tv);
		req->trace_rec->req_usec = ((uint64_t) tv->tv_sec * 1000000) +
tv->tv_usec;
		req->trace_rec->flags |= (CAM_TRACE_F_REQ_VALID |
CAM_TRACE_F_INPROGRESS);
	}
}

----------------

/* Stuff the response data into the trace record.
 *
 *  If the trace record is present and valid, copy the CDB **as returned by
 *  the drive** into the record. At this point, we also have the result
from
 *  the command (status, error, (asc? ascq?)), so save them too. Set
 *  CAM_TRACE_F_RESP_VALID to indicate that the response data is valid.
 *  Clear INPROGRESS and set COMPLETE instead.
 *
 *  This should be done just after getting the response information from
the
 *  hardware
 */
void
trace_rec_resp(cam_request_t req) {
	struct timeval tv;
	if (req->trace_rec && (req->trace_rec->flags & CAM_TRACE_F_VALID)) {
		bcopy(req->cdb, req->trace_rec->resp_cdb, sizeof(req->cdb));
		microuptime(&tv);
		req->trace_rec->resp_usec = ((uint64_t) tv->tv_sec * 1000000) +
tv->tv_usec;
		req->trace_rec->flags |= CAM_TRACE_F_RESP_VALID | CAM_TRACE_F_COMPLETE);
		req->trace_rec->flags &= ~CAM_TRACE_F_INPROGRESS;
	}
}

----------------

/* Update the trace record if there was a timeout.
 *
 *  If the trace record is present and valid, set TIMEDOUT to indicate
 *  that the request associated with the record exceeded its timeout.
 *  Since this attempt is complete, clear INPROGRESS and set COMPLETE.
 *
 *  This should be done along with other timeout handling.
 */
void
trace_rec_timeout(cam_request_t req) {
	if (req->trace_rec && (req->trace_rec->flags & CAM_TRACE_F_VALID)) {
		req->trace_rec->flags &= ~CAM_TRACE_F_INPROGRESS;
		req->trace_rec->flags |= (CAM_TRACE_F_TIMEDOUT | CAM_TRACE_F_COMPLETE);
	}
}

----------------

/* Update the trace record if the request is being retried.
 *  
 *  If the trace record is present and valid, set RETRIED to indicate
 *  that the request associated with the record is going to be retried.
 *  Since this attempt is complete, clear INPROGRESS and set COMPLETE.
 *  Then get a new trace record for the new attempt, copy the request
 *  command data to the new record, and set IS_RETRY to show that this
 *  is a retry of an earlier request.
 *
 *  This should be done at the time the retry is queued.
 */
void
trace_rec_retry(cam_request_t req) {
	if (req->trace_rec && (req->trace_rec->flags & PAN_ATA_TRACE_F_VALID)) {
	/* First, mark the original request as being requeued. */
	req->trace_rec->flags &= ~CAM_TRACE_F_INPROGRESS;
	req->trace_rec->flags |= (CAM_TRACE_F_RETRIED | CAM_TRACE_F_COMPLETE);
	/* Get a new record for the retry. */
	trace_rec_init(req);
	/* Mark the new record as being a retry. */
	req->trace_rec->flags |= CAM_TRACE_F_IS_RETRY;
	}
}

================================

I'd appreciate feedback, both on the implementation details and on the
idea as a whole. Some open questions are:

- My assumption is that each SIM has a structure that associates a
PERIPH's softc with the CDB that's being processed. It might require
dereferencing a chain of pointers, but it should be there in some fashion.

- The current implementation adds a few ioctls to the old ad(4)
disk-device driver. I would expect to add the new ioctls to da(4) and
ada(4); is that the right thing to do?

Again, any feedback would be greatly appreciated. I'm at the Dev/Vendor
Summit today and tomorrow if anyone wants to discuss this with me in
person.

Thanks,

Ravi