Low-level trace-buffers in CAM
Pokala, Ravi
rpokala at panasas.com
Tue Nov 4 20:20:20 UTC 2014
Re-posting something that I sent to freebsd-geom@ last night; I didn't see
a freebsd-cam@ list, but I was told this morning that CAM issues should go
to freebsd-scsi at . Also adding some known-interested parties.
Thanks,
Ravi
----------------
Hi folks,
At Panasas, we found it useful to have a trace buffer for each ATA drive
in the system, for gathering information about command sequences and
latency. I implemented that functionality for our old 7-ish ATA driver,
and it works quite well, with fairly low overhead. For example (sorry for
the really long lines...):
# Tabular format for use w/ other tools
# Request time ATA request information Response time ATA
response information sts/err flags
1415043680586349 25 0000 0200 000001b8c200 1415043680587357 25
0000 0200 000001b8c200 50 00 0000007d
1415043680587389 25 0000 0200 000001b8c400 1415043680588396 25
0000 0200 000001b8c400 50 00 0000007d
1415043680588430 25 0000 0200 000001b8c600 1415043680589406 25
0000 0200 000001b8c600 50 00 0000007d
1415043681203051 ca 0000 0008 000000000480 1415043681203279 ca
0000 0008 000000000480 50 00 0000007d
1415043681203284 ca 0000 0008 000000002f00 1415043681204022 ca
0000 0008 000000002f00 50 00 0000007d
1415043681596331 25 0000 0200 000001b8c800 1415043681597389 25
0000 0200 000001b8c800 50 00 0000007d
1415043681597423 25 0000 0200 000001b8ca00 1415043681598434 25
0000 0200 000001b8ca00 50 00 0000007d
1415043681598466 25 0000 0200 000001b8cc00 1415043681599484 25
0000 0200 000001b8cc00 50 00 0000007d
1415043681599517 25 0000 0200 000001b8ce00 1415043681600515 25
0000 0200 000001b8ce00 50 00 0000007d
1415043681934541 25 0000 0200 000000586600 1415043681935579 25
0000 0200 000000586600 50 00 0000007d
# Human-readable
FLAGS REQUEST TIME RESPONSE TIME
ELAPSED TIME REQUEST COMMAND
STATUS ERROR RESPONSE COMMAND
-------------------------------- -------------------- --------------------
--------------------
---------------------------------------------------------------------------
----- -------- --------
---------------------------------------------------------------------------
-----
____C...._V 1415043680586349 1415043680587357
1008 READ DMA EXT (LBA 28885504 + 512 sectors)
_R_S____ ________ ----
____C...._V 1415043680587389 1415043680588396
1007 READ DMA EXT (LBA 28886016 + 512 sectors)
_R_S____ ________ ----
____C...._V 1415043680588430 1415043680589406
976 READ DMA EXT (LBA 28886528 + 512 sectors)
_R_S____ ________ ----
____C...._V 1415043681203051 1415043681203279
228 WRITE DMA (LBA 1152 + 8 sectors)
_R_S____ ________ ----
____C...._V 1415043681203284 1415043681204022
738 WRITE DMA (LBA 12032 + 8 sectors)
_R_S____ ________ ----
____C...._V 1415043681596331 1415043681597389
1058 READ DMA EXT (LBA 28887040 + 512 sectors)
_R_S____ ________ ----
____C...._V 1415043681597423 1415043681598434
1011 READ DMA EXT (LBA 28887552 + 512 sectors)
_R_S____ ________ ----
____C...._V 1415043681598466 1415043681599484
1018 READ DMA EXT (LBA 28888064 + 512 sectors)
_R_S____ ________ ----
As we start moving toward a more modern base release, we want to
re-implement that functionality for ATA, SCSI, NVMe, etc. And push it
upsream.
The obvious place to put this sort of tracing would be in GEOM, but that
actually isn't what we want, for a few reasons. First of all, we're
primarily interested in the exact register / taskfile / CDB values sent to
the drive - opcode, feature codes, LBAs, etc. Second, we're interested in
hardware-level latency - how long does a request stay in the SIM and
PERIPH. Both of those pretty much require working below GEOM.
Therefore, I think we want to do it in the CAM stack; specifically, the
hooks would need to go into some of the SIMs. (Yes, I know, this requires
touching every SIM which we want to use. It's not optimal, but the hooks
should be pretty lightweight and easy to add, and nothing will break if a
given SIM isn't updated.)
I took our implementation, and pseudo-coded what it might look like in
CAM. (I could probably make diffs against the old ATA stack available for
comparison, but I figured no one would care anymore.) A huge caveat is
that I've never really poked in CAM before, so I fully expect I'm doing
several things wrong. Please educate me. :-)
================================
Create a circular buffer of trace records for each CAM PERIPH, getting
information as close to the device as possible. Each record will contain
the full CDB for the request as sent to the device, the timestamp (in
usec) of the request, the CDB as returned by the device, the timestamp of
the response, and the status and error code. Additionally, flags are
provided to track the state of the trace record, and to indicate which
type of periph the record is associated with.
typedef struct {
u_int32_t flags;
#define CAM_TRACE_F_VALID 0x00000001 /* valid record */
#define CAM_TRACE_F_INPROGRESS 0x00000002 /* request in progress */
#define CAM_TRACE_F_REQ_VALID 0x00000004 /* request data valid */
#define CAM_TRACE_F_RESP_VALID 0x00000008 /* response data valid */
#define CAM_TRACE_F_COMPLETE 0x00000010 /* response gathered */
#define CAM_TRACE_F_TIMEDOUT 0x00000020 /* request timed out */
#define CAM_TRACE_F_ABANDONED 0x00000040 /* cancelled, etc. */
#define CAM_TRACE_F_RETRIED 0x00000080 /* retry in later record */
#define CAM_TRACE_F_IS_RETRY 0x00000100 /* retry of earlier req. */
#define CAM_TRACE_F_PERIPH_MASK 0xf0000000 /* up to 16 types of PERIPHs */
#define CAM_TRACE_F_PERIPH_DA 0x0
#define CAM_TRACE_F_PERIPH_ADA 0x1
#define CAM_TRACE_F_PERIPH_NVD 0x2
u_int64_t req_usec;
cdb_t req_cdb;
u_int64_t resp_usec;
cdb_t resp_cdb;
u_int8_t resp_status;
u_int8_t resp_error;
u_int8_t resp_asc; /* Needed for SCSI? */
u_int8_t resp_asq; /* Needed for SCSI? */
u_int8_t padding[???]; /* Pad for alignment? */
/* There's going to be a large-ish array of these in the kernel; make sure
they're packed and aligned to minimize space usage and keep things
aligned. The original implementation was 48 bytes, which was already
aligned. */
} cam_trace_record_t __attribute__((packed, aligned(8)));
typedef struct {
u_int32_t trace_buf_record_count;
cam_trace_record_t *trace_buf;
} cam_trace_buffer_t;
#define CAM_TRACE_BUFFER_DEFAULT_COUNT 100000
----------------
ioctl interfaces are used to get the trace buffer size, to retrieve the
buffer contents, or to clear the buffer. The buffer can then be analyzed
from userspace, or written to a file for post-processing.
#define CAM_GET_TRACE_RECORD_COUNT _IOR(???, ???, u_int32_t)
#define CAM_GET_TRACE_BUFFER _IOWR(???, ???, cam_trace_buffer_t)
#define CAM_CLEAR_TRACE_BUFFER _IOW(???, ???, u_int32_t)
----------------
To get the buffer:
cam_trace_buffer_t trace = NULL;
fd = open(periph_name, O_RDONLY);
error = ioctl(fd, CAM_GET_TRACE_RECORD_COUNT,
&trace.trace_buf_record_count);
trace.trace_buf = malloc(record_count * sizeof (cam_trace_record_t));
error = ioctl(fd, CAM_GET_TRACE_BUFFER, &trace);
By including the full CDB, the entire command can be decoded by analysis
tools. Tools can be written to parse the buffer and translate command data
into human-readable descriptions, calculate elapsed times for requests,
perform statistical analysis based on opcode, command type, transfer size,
etc. Command decoders would need to be written for each PERIPH type (i.e.
one for ATA, one for SCSI, one for NVMe, etc.).
----------------
Each PERIPH's softc needs to include some information about the per-device
buffer.
u_int32_t trace_buf_record_count;
u_int64_t trace_buf_request_count;
cam_trace_record_t *trace_buf;
In each SIM, there should already be some sort of request structure, which
associates the softc and the CDB of the request. To that structure, add a
pointer to the trace record associated with the request.
request structure:
cam_trace_record_t *trace_rec;
----------------
When initializing the per-device softc, allocate a trace buffer. The
number of records can be hard-coded, or settable via a tunable.
/** Initialize per-drive command tracing buffer. **/
/* Get the buffer size from the tunable, if present. Sanity check it. */
sc->trace_buf_request_count = 0;
sc->trace_buf_record_count = CAM_TRACE_BUFFER_DEFAULT_COUNT;
TUNABLE_ULONG_FETCH("cam.trace_buf_record_count",
&trace_buf_record_count_tunable);
if (trace_buf_record_count_tunable) {
/* Less than 1K records is unusably few; more than 1M records is
unusably many. */
if ((trace_buf_record_count_tunable < (1 << 10)) ||
(trace_buf_record_count_tunable > (1 << 20))) {
device_printf(dev, "cam.trace_buf_record_count=%lu" " out of range;
should be between %u and %u\n", trace_buf_record_count_tunable, (1 << 10),
(1 << 20));
} else {
sc->trace_buf_record_count = (u_int32_t) trace_buf_record_count_tunable;
}
}
/* Allocate the buffer. */
trace_buf_bytes = sc->trace_buf_record_count * sizeof(cam_trace_record_t);
sc->trace_buf = malloc(trace_buf_bytes, M_AD, M_NOWAIT | M_ZERO);
if (sc->trace_buf == NULL) {
device_printf(dev, "unable to allocate %zd for trace_buf\n",
trace_buf_bytes);
sc->trace_buf_record_count = 0;
} else {
device_printf(dev, "allocated %zd @ %p for trace_buf\n",
trace_buf_bytes, sc->trace_buf);
}
----------------
/* Initial setup of the trace record for a new request.
*
* Locate the circular per-device trace buffer in the drive's softc.
* Populate the pointer to the next record in the trace buffer; this is
* done atomically to prevent concurrent requests from getting the same
* index. That could get a bit weird if/when 'trace_buf_request_count'
* rolls over, if the maximum counter value is not a multiple of
* 'trace_buf_record_count' (i.e. the request counter recycles in the
* middle of the buffer). By using a 64-bit request counter, we're pretty
* safe - even at one transaction per nanosecond, that will still last
* about 585 years.
* Initialize the record by zeroing it and then setting the VALID flag.
*
* This could be done at any point before sending the request to the
* hardware.
*/
void
trace_rec_init(cam_request_t req) {
ada_softc *sc = (request->...->softc)
u_int64_t tmp = 0;
if (sc && sc->trace_buf) {
tmp = atomic_fetchadd_log(&sc->trace_buf_request_count, 1);
req->trace_rec = &sc->trace_buf[tmp % sc->trace_buf_record_count];
bzero(req->trace_rec, sizeof(*req->trace_rec));
req->trace_rec->flags = (CAM_TRACE_F_PERIPH_XXX << 28);
req->trace_rec->flags |= CAM_TRACE_F_VALID;
} else {
req->trace_rec = NULL;
}
}
----------------
/* Stuff the request data into the trace record.
*
* If the trace record is present and valid, copy the CDB into the
* record. Set CAM_TRACE_F_REQ_VALID to indicate that the request data is
* valid, and INPROGRESS to show that the request is in-progress.
*
* This should be done just before sending the request to the hardware.
* This should *not* be fused into trace_rec_init(), because
* trace_rec_init() is also used in the retry case, where it is not
correct
* to immediately start the timer.
*/
void
trace_rec_req(cam_request_t req) {
struct timeval tv;
if (req->trace_rec && (req->trace_rec->flags & CAM_TRACE_F_VALID)) {
bcopy(req->cdb, req->trace_rec->req_cdb, sizeof(req->cdb));
microuptime(&tv);
req->trace_rec->req_usec = ((uint64_t) tv->tv_sec * 1000000) +
tv->tv_usec;
req->trace_rec->flags |= (CAM_TRACE_F_REQ_VALID |
CAM_TRACE_F_INPROGRESS);
}
}
----------------
/* Stuff the response data into the trace record.
*
* If the trace record is present and valid, copy the CDB **as returned by
* the drive** into the record. At this point, we also have the result
from
* the command (status, error, (asc? ascq?)), so save them too. Set
* CAM_TRACE_F_RESP_VALID to indicate that the response data is valid.
* Clear INPROGRESS and set COMPLETE instead.
*
* This should be done just after getting the response information from
the
* hardware
*/
void
trace_rec_resp(cam_request_t req) {
struct timeval tv;
if (req->trace_rec && (req->trace_rec->flags & CAM_TRACE_F_VALID)) {
bcopy(req->cdb, req->trace_rec->resp_cdb, sizeof(req->cdb));
microuptime(&tv);
req->trace_rec->resp_usec = ((uint64_t) tv->tv_sec * 1000000) +
tv->tv_usec;
req->trace_rec->flags |= CAM_TRACE_F_RESP_VALID | CAM_TRACE_F_COMPLETE);
req->trace_rec->flags &= ~CAM_TRACE_F_INPROGRESS;
}
}
----------------
/* Update the trace record if there was a timeout.
*
* If the trace record is present and valid, set TIMEDOUT to indicate
* that the request associated with the record exceeded its timeout.
* Since this attempt is complete, clear INPROGRESS and set COMPLETE.
*
* This should be done along with other timeout handling.
*/
void
trace_rec_timeout(cam_request_t req) {
if (req->trace_rec && (req->trace_rec->flags & CAM_TRACE_F_VALID)) {
req->trace_rec->flags &= ~CAM_TRACE_F_INPROGRESS;
req->trace_rec->flags |= (CAM_TRACE_F_TIMEDOUT | CAM_TRACE_F_COMPLETE);
}
}
----------------
/* Update the trace record if the request is being retried.
*
* If the trace record is present and valid, set RETRIED to indicate
* that the request associated with the record is going to be retried.
* Since this attempt is complete, clear INPROGRESS and set COMPLETE.
* Then get a new trace record for the new attempt, copy the request
* command data to the new record, and set IS_RETRY to show that this
* is a retry of an earlier request.
*
* This should be done at the time the retry is queued.
*/
void
trace_rec_retry(cam_request_t req) {
if (req->trace_rec && (req->trace_rec->flags & PAN_ATA_TRACE_F_VALID)) {
/* First, mark the original request as being requeued. */
req->trace_rec->flags &= ~CAM_TRACE_F_INPROGRESS;
req->trace_rec->flags |= (CAM_TRACE_F_RETRIED | CAM_TRACE_F_COMPLETE);
/* Get a new record for the retry. */
trace_rec_init(req);
/* Mark the new record as being a retry. */
req->trace_rec->flags |= CAM_TRACE_F_IS_RETRY;
}
}
================================
I'd appreciate feedback, both on the implementation details and on the
idea as a whole. Some open questions are:
- My assumption is that each SIM has a structure that associates a
PERIPH's softc with the CDB that's being processed. It might require
dereferencing a chain of pointers, but it should be there in some fashion.
- The current implementation adds a few ioctls to the old ad(4)
disk-device driver. I would expect to add the new ioctls to da(4) and
ada(4); is that the right thing to do?
Again, any feedback would be greatly appreciated. I'm at the Dev/Vendor
Summit today and tomorrow if anyone wants to discuss this with me in
person.
Thanks,
Ravi
More information about the freebsd-scsi
mailing list