Re: f0f3e3e961d3 - main - ipmi: use a queue for kcs driver requests when possible

From: Ravi Pokala <rpokala_at_freebsd.org>
Date: Tue, 01 Nov 2022 22:09:09 UTC
My understanding is that IPMB and KCS are different physical-layer interfaces to the BMC, and only one of them is active on any given motherboard. So if the kernel is using KCS, then so is userland. This can be grepped out of /var/run/dmesg.boot:

    ipmi0: <IPMI System Interface> port 0xca2,0xca3 on acpi0
    ipmi0: KCS mode found at io 0xca2 on acpi
    ipmi0: IPMI device rev. 1, firmware rev. 1.73, version 2.0, device support mask 0xbf
    ipmi0: Number of channels 2
    ipmi0: Attached watchdog
    ipmi0: Establishing power cycle handler

-Ravi (rpokala@)

-----Original Message-----
From: Chuck Silvers <chuq@chuq.com>
Date: 2022-11-01, Tuesday at 14:55
To: Ravi Pokala <rpokala@freebsd.org>
Cc: <src-committers@FreeBSD.org>, <dev-commits-src-all@FreeBSD.org>, <dev-commits-src-main@FreeBSD.org>
Subject: Re: f0f3e3e961d3 - main - ipmi: use a queue for kcs driver requests when possible

    Hi Ravi,

    the usual path (the IPMI_SYSTEM_INTERFACE_ADDR_TYPE case) for an "application" request
    (ie. one from userland) is:

    ipmi_polled_enqueue_request()
    sc->ipmi_enqueue_request()
    ipmi_ioctl()

    which uses the regular (low priority) queue.

    in the non-IPMI_SYSTEM_INTERFACE_ADDR_TYPE case (aka. the IPMB case),
    ipmi_ioctl() does issue two driver requests to set up the IPMB stuff before
    then issuing the real application request with sc->ipmi_enqueue_request() as above.
    I don't know much about this IPMB thing, it looks like a kind of pass-through
    interface for communicating with another device that sits behind the main
    BMC IPMI controller.

    I suppose that it would make sense to change these two IPMB setup driver requests
    to use the regular (low priority) queue (since sleeping is ok in that path),
    but I don't think I have any way to test that since our systems don't use this path.

    -Chuck


    On Tue, Nov 01, 2022 at 11:53:57AM -0700, Ravi Pokala wrote:
    > Hi Chuck,
    > 
    > As best as I can tell, the stack for handling a request from userland (e.g. `ipmitool') looks like this:
    > 
    > ipmi_polled_enqueue_request_highpri()
    > kcs_driver_request_queue() /* assuming not panicked or dumping core */
    > kcs_driver_request()
    > sc->ipmi_driver_request()
    > ipmi_submit_driver_request()
    > ipmi_ioctl()
    > 
    > And the stack for handling kernel-internal requests (i.e. watchdog) is essentially the same, except for calling ipmi_submit_driver_request() directly rather than going through ipmi_ioctl(). Specifically, there doesn't seem to be anything that would put the two different types of requests on two different queues; it looks like everything goes onto the highpri queue.
    > 
    > What am I missing?
    > 
    > Thanks,
    > 
    > Ravi (rpokala@)
    > 
    > -----Original Message-----
    > From: <owner-src-committers@freebsd.org> on behalf of Chuck Silvers <chs@FreeBSD.org>
    > Date: 2022-11-01, Tuesday at 10:55
    > To: <src-committers@FreeBSD.org>, <dev-commits-src-all@FreeBSD.org>, <dev-commits-src-main@FreeBSD.org>
    > Subject: git: f0f3e3e961d3 - main - ipmi: use a queue for kcs driver requests when possible
    > 
    >     The branch main has been updated by chs:
    > 
    >     URL: https://cgit.FreeBSD.org/src/commit/?id=f0f3e3e961d311f1cd938f1319385e7f454525f1
    > 
    >     commit f0f3e3e961d311f1cd938f1319385e7f454525f1
    >     Author:     Chuck Silvers <chs@FreeBSD.org>
    >     AuthorDate: 2022-11-01 17:55:14 +0000
    >     Commit:     Chuck Silvers <chs@FreeBSD.org>
    >     CommitDate: 2022-11-01 17:55:14 +0000
    > 
    >         ipmi: use a queue for kcs driver requests when possible
    > 
    >         The ipmi watchdog pretimeout action can trigger unintentionally in
    >         certain rare, complicated situations.  What we have seen at Netflix
    >         is that the BMC can sometimes be sent a continuous stream of
    >         writes to port 0x80, and due to what is a bug or misconfiguration
    >         in the BMC software, this results in the BMC running out of memory,
    >         becoming very slow to respond to KCS requests, and eventually being
    >         rebooted by its own internal watchdog.  While that is going on in
    >         the BMC, back in the host OS, a number of requests are pending in
    >         the ipmi request queue, and the kcs_loop thread is working on
    >         processing these requests.  All of the KCS accesses to process
    >         those requests are timing out and eventually failing because the
    >         BMC is responding very slowly or not at all, and the kcs_loop thread
    >         is holding the IPMI_IO_LOCK the whole time that is going on.
    >         Meanwhile the watchdogd process in the host is trying to pat the
    >         BMC watchdog, and this process is sleeping waiting to get the
    >         IPMI_IO_LOCK.  It's not entirely clear why the watchdogd process
    >         is sleeping for this lock, because the intention is that a thread
    >         holding the IPMI_IO_LOCK should not sleep and thus any thread
    >         that wants the lock should just spin to wait for it.  My best guess
    >         is that the kcs_loop thread is spinning waiting for the BMC to
    >         respond for so long that it is eventually preempted, and during
    >         the brief interval when the kcs_loop thread is not running,
    >         the watchdogd thread notices that the lock holder is not running
    >         and sleeps.  When the kcs_loop thread eventually finishes processing
    >         one request, it drops the IPMI_IO_LOCK and then immediately takes the
    >         lock again so it can process the next request in the queue.
    >         Because the watchdogd thread is sleeping at this point, the kcs_loop
    >         always wins the race to acquire the IPMI_IO_LOCK, thus starving
    >         the watchdogd thread.  The callout for the watchdog pretimeout
    >         would be reset by the watchdogd thread after its request to the BMC
    >         watchdog completes, but since that request never processed, the
    >         pretimeout callout eventually fires, even though there is nothing
    >         actually wrong with the host.
    > 
    >         To prevent this saga from unfolding:
    > 
    >          - when kcs_driver_request() is called in a context where it can sleep,
    >            queue the request and let the worker thread process it rather than
    >            trying to process in the original thread.
    >          - add a new high-priority queue for driver requests, so that the
    >            watchdog patting requests will be processed as quickly as possible
    >            even if lots of application requests have already been queued.
    > 
    >         With these two changes, the watchdog pretimeout action does not trigger
    >         even if the BMC is completely out to lunch for long periods of time
    >         (as long as the watchdogd check command does not also get stuck).
    > 
    >         Sponsored by:   Netflix
    >         Reviewed by:    imp
    >         Differential Revision:  https://reviews.freebsd.org/D36555
    >     ---
    >      sys/dev/ipmi/ipmi.c     | 33 ++++++++++++++++++++++++++++++---
    >      sys/dev/ipmi/ipmi_kcs.c | 28 +++++++++++++++++++++++++++-
    >      sys/dev/ipmi/ipmivars.h |  2 ++
    >      3 files changed, 59 insertions(+), 4 deletions(-)
    > 
    >     diff --git a/sys/dev/ipmi/ipmi.c b/sys/dev/ipmi/ipmi.c
    >     index d79690d55c68..b8705a81627b 100644
    >     --- a/sys/dev/ipmi/ipmi.c
    >     +++ b/sys/dev/ipmi/ipmi.c
    >     @@ -208,6 +208,15 @@ ipmi_dtor(void *arg)
    >      	IPMI_LOCK(sc);
    >      	if (dev->ipmi_requests) {
    >      		/* Throw away any pending requests for this device. */
    >     +		TAILQ_FOREACH_SAFE(req, &sc->ipmi_pending_requests_highpri, ir_link,
    >     +		    nreq) {
    >     +			if (req->ir_owner == dev) {
    >     +				TAILQ_REMOVE(&sc->ipmi_pending_requests_highpri, req,
    >     +				    ir_link);
    >     +				dev->ipmi_requests--;
    >     +				ipmi_free_request(req);
    >     +			}
    >     +		}
    >      		TAILQ_FOREACH_SAFE(req, &sc->ipmi_pending_requests, ir_link,
    >      		    nreq) {
    >      			if (req->ir_owner == dev) {
    >     @@ -579,13 +588,19 @@ ipmi_dequeue_request(struct ipmi_softc *sc)
    > 
    >      	IPMI_LOCK_ASSERT(sc);
    > 
    >     -	while (!sc->ipmi_detaching && TAILQ_EMPTY(&sc->ipmi_pending_requests))
    >     +	while (!sc->ipmi_detaching && TAILQ_EMPTY(&sc->ipmi_pending_requests) &&
    >     +	    TAILQ_EMPTY(&sc->ipmi_pending_requests_highpri))
    >      		cv_wait(&sc->ipmi_request_added, &sc->ipmi_requests_lock);
    >      	if (sc->ipmi_detaching)
    >      		return (NULL);
    > 
    >     -	req = TAILQ_FIRST(&sc->ipmi_pending_requests);
    >     -	TAILQ_REMOVE(&sc->ipmi_pending_requests, req, ir_link);
    >     +	req = TAILQ_FIRST(&sc->ipmi_pending_requests_highpri);
    >     +	if (req != NULL)
    >     +		TAILQ_REMOVE(&sc->ipmi_pending_requests_highpri, req, ir_link);
    >     +	else {
    >     +		req = TAILQ_FIRST(&sc->ipmi_pending_requests);
    >     +		TAILQ_REMOVE(&sc->ipmi_pending_requests, req, ir_link);
    >     +	}
    >      	return (req);
    >      }
    > 
    >     @@ -601,6 +616,17 @@ ipmi_polled_enqueue_request(struct ipmi_softc *sc, struct ipmi_request *req)
    >      	return (0);
    >      }
    > 
    >     +int
    >     +ipmi_polled_enqueue_request_highpri(struct ipmi_softc *sc, struct ipmi_request *req)
    >     +{
    >     +
    >     +	IPMI_LOCK_ASSERT(sc);
    >     +
    >     +	TAILQ_INSERT_TAIL(&sc->ipmi_pending_requests_highpri, req, ir_link);
    >     +	cv_signal(&sc->ipmi_request_added);
    >     +	return (0);
    >     +}
    >     +
    >      /*
    >       * Watchdog event handler.
    >       */
    >     @@ -817,6 +843,7 @@ ipmi_startup(void *arg)
    >      	mtx_init(&sc->ipmi_requests_lock, "ipmi requests", NULL, MTX_DEF);
    >      	mtx_init(&sc->ipmi_io_lock, "ipmi io", NULL, MTX_DEF);
    >      	cv_init(&sc->ipmi_request_added, "ipmireq");
    >     +	TAILQ_INIT(&sc->ipmi_pending_requests_highpri);
    >      	TAILQ_INIT(&sc->ipmi_pending_requests);
    > 
    >      	/* Initialize interface-dependent state. */
    >     diff --git a/sys/dev/ipmi/ipmi_kcs.c b/sys/dev/ipmi/ipmi_kcs.c
    >     index df3b37614eb7..5908ec88e039 100644
    >     --- a/sys/dev/ipmi/ipmi_kcs.c
    >     +++ b/sys/dev/ipmi/ipmi_kcs.c
    >     @@ -32,6 +32,7 @@ __FBSDID("$FreeBSD$");
    >      #include <sys/param.h>
    >      #include <sys/systm.h>
    >      #include <sys/bus.h>
    >     +#include <sys/conf.h>
    >      #include <sys/condvar.h>
    >      #include <sys/eventhandler.h>
    >      #include <sys/kernel.h>
    >     @@ -490,7 +491,21 @@ kcs_startup(struct ipmi_softc *sc)
    >      }
    > 
    >      static int
    >     -kcs_driver_request(struct ipmi_softc *sc, struct ipmi_request *req, int timo)
    >     +kcs_driver_request_queue(struct ipmi_softc *sc, struct ipmi_request *req, int timo)
    >     +{
    >     +	int error;
    >     +
    >     +	IPMI_LOCK(sc);
    >     +	ipmi_polled_enqueue_request_highpri(sc, req);
    >     +	error = msleep(req, &sc->ipmi_requests_lock, 0, "ipmireq", timo);
    >     +	if (error == 0)
    >     +		error = req->ir_error;
    >     +	IPMI_UNLOCK(sc);
    >     +	return (error);
    >     +}
    >     +
    >     +static int
    >     +kcs_driver_request_poll(struct ipmi_softc *sc, struct ipmi_request *req)
    >      {
    >      	int i, ok;
    > 
    >     @@ -504,6 +519,17 @@ kcs_driver_request(struct ipmi_softc *sc, struct ipmi_request *req, int timo)
    >      	return (req->ir_error);
    >      }
    > 
    >     +static int
    >     +kcs_driver_request(struct ipmi_softc *sc, struct ipmi_request *req, int timo)
    >     +{
    >     +
    >     +	if (KERNEL_PANICKED() || dumping)
    >     +		return (kcs_driver_request_poll(sc, req));
    >     +	else
    >     +		return (kcs_driver_request_queue(sc, req, timo));
    >     +}
    >     +
    >     +
    >      int
    >      ipmi_kcs_attach(struct ipmi_softc *sc)
    >      {
    >     diff --git a/sys/dev/ipmi/ipmivars.h b/sys/dev/ipmi/ipmivars.h
    >     index b0548ee3d7c3..1468e86be73b 100644
    >     --- a/sys/dev/ipmi/ipmivars.h
    >     +++ b/sys/dev/ipmi/ipmivars.h
    >     @@ -111,6 +111,7 @@ struct ipmi_softc {
    >      	uint8_t			ipmi_dev_support;	/* IPMI_ADS_* */
    >      	struct cdev		*ipmi_cdev;
    >      	TAILQ_HEAD(,ipmi_request) ipmi_pending_requests;
    >     +	TAILQ_HEAD(,ipmi_request) ipmi_pending_requests_highpri;
    >      	int			ipmi_driver_requests_polled;
    >      	eventhandler_tag	ipmi_power_cycle_tag;
    >      	eventhandler_tag	ipmi_watchdog_tag;
    >     @@ -237,6 +238,7 @@ void	ipmi_complete_request(struct ipmi_softc *, struct ipmi_request *);
    >      struct ipmi_request *ipmi_dequeue_request(struct ipmi_softc *);
    >      void	ipmi_free_request(struct ipmi_request *);
    >      int	ipmi_polled_enqueue_request(struct ipmi_softc *, struct ipmi_request *);
    >     +int	ipmi_polled_enqueue_request_highpri(struct ipmi_softc *, struct ipmi_request *);
    >      int	ipmi_submit_driver_request(struct ipmi_softc *, struct ipmi_request *,
    >      	    int);
    > 
    >