From nobody Tue Nov 01 21:55:51 2022 X-Original-To: dev-commits-src-main@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4N23k83mqJz4g80s; Tue, 1 Nov 2022 21:56:00 +0000 (UTC) (envelope-from chuq@chuq.com) Received: from vc.chuq.com (vc.chuq.com [166.84.7.169]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "Chuck Silvers", Issuer "Chuck Silvers" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 4N23k74KJdz3RQc; Tue, 1 Nov 2022 21:55:59 +0000 (UTC) (envelope-from chuq@chuq.com) Received: from spathi.chuq.com (spathi.chuq.com [98.51.95.93]) by vc169.vc.panix.com (Postfix) with ESMTPS id 49FE03B5D7B; Tue, 1 Nov 2022 14:55:53 -0700 (PDT) (envelope-from chuq@chuq.com) Received: by spathi.chuq.com (Postfix, from userid 1022) id 6F4B13CFFA24; Tue, 1 Nov 2022 14:55:51 -0700 (PDT) Date: Tue, 1 Nov 2022 14:55:51 -0700 From: Chuck Silvers To: Ravi Pokala Cc: src-committers@FreeBSD.org, dev-commits-src-all@FreeBSD.org, dev-commits-src-main@FreeBSD.org Subject: Re: f0f3e3e961d3 - main - ipmi: use a queue for kcs driver requests when possible Message-ID: References: <202211011755.2A1HtMll099820@gitrepo.freebsd.org> <930FB7AD-A360-460E-9638-AEEDD85AF2DE@panasas.com> List-Id: Commit messages for the main branch of the src repository List-Archive: https://lists.freebsd.org/archives/dev-commits-src-main List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-dev-commits-src-main@freebsd.org X-BeenThere: dev-commits-src-main@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset=utf8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <930FB7AD-A360-460E-9638-AEEDD85AF2DE@panasas.com> X-Rspamd-Queue-Id: 4N23k74KJdz3RQc X-Spamd-Bar: --- Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=none; spf=pass (mx1.freebsd.org: domain of chuq@chuq.com designates 166.84.7.169 as permitted sender) smtp.mailfrom=chuq@chuq.com X-Spamd-Result: default: False [-3.25 / 15.00]; NEURAL_HAM_MEDIUM(-1.00)[-0.999]; NEURAL_HAM_LONG(-0.99)[-0.993]; NEURAL_HAM_SHORT(-0.95)[-0.954]; R_SPF_ALLOW(-0.20)[+mx]; MIME_GOOD(-0.10)[text/plain]; MLMMJ_DEST(0.00)[dev-commits-src-all@FreeBSD.org,dev-commits-src-main@FreeBSD.org]; ARC_NA(0.00)[]; R_DKIM_NA(0.00)[]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+]; MID_RHS_MATCH_FROMTLD(0.00)[]; TO_DN_SOME(0.00)[]; ASN(0.00)[asn:2033, ipnet:166.84.0.0/16, country:US]; FROM_HAS_DN(0.00)[]; RCVD_COUNT_THREE(0.00)[3]; RCPT_COUNT_THREE(0.00)[4]; DMARC_NA(0.00)[chuq.com]; TO_MATCH_ENVRCPT_ALL(0.00)[]; RCVD_TLS_LAST(0.00)[] X-ThisMailContainsUnwantedMimeParts: N Hi Ravi, the usual path (the IPMI_SYSTEM_INTERFACE_ADDR_TYPE case) for an "application" request (ie. one from userland) is: ipmi_polled_enqueue_request() sc->ipmi_enqueue_request() ipmi_ioctl() which uses the regular (low priority) queue. in the non-IPMI_SYSTEM_INTERFACE_ADDR_TYPE case (aka. the IPMB case), ipmi_ioctl() does issue two driver requests to set up the IPMB stuff before then issuing the real application request with sc->ipmi_enqueue_request() as above. I don't know much about this IPMB thing, it looks like a kind of pass-through interface for communicating with another device that sits behind the main BMC IPMI controller. I suppose that it would make sense to change these two IPMB setup driver requests to use the regular (low priority) queue (since sleeping is ok in that path), but I don't think I have any way to test that since our systems don't use this path. -Chuck On Tue, Nov 01, 2022 at 11:53:57AM -0700, Ravi Pokala wrote: > Hi Chuck, > > As best as I can tell, the stack for handling a request from userland (e.g. `ipmitool') looks like this: > > ipmi_polled_enqueue_request_highpri() > kcs_driver_request_queue() /* assuming not panicked or dumping core */ > kcs_driver_request() > sc->ipmi_driver_request() > ipmi_submit_driver_request() > ipmi_ioctl() > > And the stack for handling kernel-internal requests (i.e. watchdog) is essentially the same, except for calling ipmi_submit_driver_request() directly rather than going through ipmi_ioctl(). Specifically, there doesn't seem to be anything that would put the two different types of requests on two different queues; it looks like everything goes onto the highpri queue. > > What am I missing? > > Thanks, > > Ravi (rpokala@) > > -----Original Message----- > From: on behalf of Chuck Silvers > Date: 2022-11-01, Tuesday at 10:55 > To: , , > Subject: git: f0f3e3e961d3 - main - ipmi: use a queue for kcs driver requests when possible > > The branch main has been updated by chs: > > URL: https://cgit.FreeBSD.org/src/commit/?id=f0f3e3e961d311f1cd938f1319385e7f454525f1 > > commit f0f3e3e961d311f1cd938f1319385e7f454525f1 > Author: Chuck Silvers > AuthorDate: 2022-11-01 17:55:14 +0000 > Commit: Chuck Silvers > CommitDate: 2022-11-01 17:55:14 +0000 > > ipmi: use a queue for kcs driver requests when possible > > The ipmi watchdog pretimeout action can trigger unintentionally in > certain rare, complicated situations. What we have seen at Netflix > is that the BMC can sometimes be sent a continuous stream of > writes to port 0x80, and due to what is a bug or misconfiguration > in the BMC software, this results in the BMC running out of memory, > becoming very slow to respond to KCS requests, and eventually being > rebooted by its own internal watchdog. While that is going on in > the BMC, back in the host OS, a number of requests are pending in > the ipmi request queue, and the kcs_loop thread is working on > processing these requests. All of the KCS accesses to process > those requests are timing out and eventually failing because the > BMC is responding very slowly or not at all, and the kcs_loop thread > is holding the IPMI_IO_LOCK the whole time that is going on. > Meanwhile the watchdogd process in the host is trying to pat the > BMC watchdog, and this process is sleeping waiting to get the > IPMI_IO_LOCK. It's not entirely clear why the watchdogd process > is sleeping for this lock, because the intention is that a thread > holding the IPMI_IO_LOCK should not sleep and thus any thread > that wants the lock should just spin to wait for it. My best guess > is that the kcs_loop thread is spinning waiting for the BMC to > respond for so long that it is eventually preempted, and during > the brief interval when the kcs_loop thread is not running, > the watchdogd thread notices that the lock holder is not running > and sleeps. When the kcs_loop thread eventually finishes processing > one request, it drops the IPMI_IO_LOCK and then immediately takes the > lock again so it can process the next request in the queue. > Because the watchdogd thread is sleeping at this point, the kcs_loop > always wins the race to acquire the IPMI_IO_LOCK, thus starving > the watchdogd thread. The callout for the watchdog pretimeout > would be reset by the watchdogd thread after its request to the BMC > watchdog completes, but since that request never processed, the > pretimeout callout eventually fires, even though there is nothing > actually wrong with the host. > > To prevent this saga from unfolding: > > - when kcs_driver_request() is called in a context where it can sleep, > queue the request and let the worker thread process it rather than > trying to process in the original thread. > - add a new high-priority queue for driver requests, so that the > watchdog patting requests will be processed as quickly as possible > even if lots of application requests have already been queued. > > With these two changes, the watchdog pretimeout action does not trigger > even if the BMC is completely out to lunch for long periods of time > (as long as the watchdogd check command does not also get stuck). > > Sponsored by: Netflix > Reviewed by: imp > Differential Revision: https://reviews.freebsd.org/D36555 > --- > sys/dev/ipmi/ipmi.c | 33 ++++++++++++++++++++++++++++++--- > sys/dev/ipmi/ipmi_kcs.c | 28 +++++++++++++++++++++++++++- > sys/dev/ipmi/ipmivars.h | 2 ++ > 3 files changed, 59 insertions(+), 4 deletions(-) > > diff --git a/sys/dev/ipmi/ipmi.c b/sys/dev/ipmi/ipmi.c > index d79690d55c68..b8705a81627b 100644 > --- a/sys/dev/ipmi/ipmi.c > +++ b/sys/dev/ipmi/ipmi.c > @@ -208,6 +208,15 @@ ipmi_dtor(void *arg) > IPMI_LOCK(sc); > if (dev->ipmi_requests) { > /* Throw away any pending requests for this device. */ > + TAILQ_FOREACH_SAFE(req, &sc->ipmi_pending_requests_highpri, ir_link, > + nreq) { > + if (req->ir_owner == dev) { > + TAILQ_REMOVE(&sc->ipmi_pending_requests_highpri, req, > + ir_link); > + dev->ipmi_requests--; > + ipmi_free_request(req); > + } > + } > TAILQ_FOREACH_SAFE(req, &sc->ipmi_pending_requests, ir_link, > nreq) { > if (req->ir_owner == dev) { > @@ -579,13 +588,19 @@ ipmi_dequeue_request(struct ipmi_softc *sc) > > IPMI_LOCK_ASSERT(sc); > > - while (!sc->ipmi_detaching && TAILQ_EMPTY(&sc->ipmi_pending_requests)) > + while (!sc->ipmi_detaching && TAILQ_EMPTY(&sc->ipmi_pending_requests) && > + TAILQ_EMPTY(&sc->ipmi_pending_requests_highpri)) > cv_wait(&sc->ipmi_request_added, &sc->ipmi_requests_lock); > if (sc->ipmi_detaching) > return (NULL); > > - req = TAILQ_FIRST(&sc->ipmi_pending_requests); > - TAILQ_REMOVE(&sc->ipmi_pending_requests, req, ir_link); > + req = TAILQ_FIRST(&sc->ipmi_pending_requests_highpri); > + if (req != NULL) > + TAILQ_REMOVE(&sc->ipmi_pending_requests_highpri, req, ir_link); > + else { > + req = TAILQ_FIRST(&sc->ipmi_pending_requests); > + TAILQ_REMOVE(&sc->ipmi_pending_requests, req, ir_link); > + } > return (req); > } > > @@ -601,6 +616,17 @@ ipmi_polled_enqueue_request(struct ipmi_softc *sc, struct ipmi_request *req) > return (0); > } > > +int > +ipmi_polled_enqueue_request_highpri(struct ipmi_softc *sc, struct ipmi_request *req) > +{ > + > + IPMI_LOCK_ASSERT(sc); > + > + TAILQ_INSERT_TAIL(&sc->ipmi_pending_requests_highpri, req, ir_link); > + cv_signal(&sc->ipmi_request_added); > + return (0); > +} > + > /* > * Watchdog event handler. > */ > @@ -817,6 +843,7 @@ ipmi_startup(void *arg) > mtx_init(&sc->ipmi_requests_lock, "ipmi requests", NULL, MTX_DEF); > mtx_init(&sc->ipmi_io_lock, "ipmi io", NULL, MTX_DEF); > cv_init(&sc->ipmi_request_added, "ipmireq"); > + TAILQ_INIT(&sc->ipmi_pending_requests_highpri); > TAILQ_INIT(&sc->ipmi_pending_requests); > > /* Initialize interface-dependent state. */ > diff --git a/sys/dev/ipmi/ipmi_kcs.c b/sys/dev/ipmi/ipmi_kcs.c > index df3b37614eb7..5908ec88e039 100644 > --- a/sys/dev/ipmi/ipmi_kcs.c > +++ b/sys/dev/ipmi/ipmi_kcs.c > @@ -32,6 +32,7 @@ __FBSDID("$FreeBSD$"); > #include > #include > #include > +#include > #include > #include > #include > @@ -490,7 +491,21 @@ kcs_startup(struct ipmi_softc *sc) > } > > static int > -kcs_driver_request(struct ipmi_softc *sc, struct ipmi_request *req, int timo) > +kcs_driver_request_queue(struct ipmi_softc *sc, struct ipmi_request *req, int timo) > +{ > + int error; > + > + IPMI_LOCK(sc); > + ipmi_polled_enqueue_request_highpri(sc, req); > + error = msleep(req, &sc->ipmi_requests_lock, 0, "ipmireq", timo); > + if (error == 0) > + error = req->ir_error; > + IPMI_UNLOCK(sc); > + return (error); > +} > + > +static int > +kcs_driver_request_poll(struct ipmi_softc *sc, struct ipmi_request *req) > { > int i, ok; > > @@ -504,6 +519,17 @@ kcs_driver_request(struct ipmi_softc *sc, struct ipmi_request *req, int timo) > return (req->ir_error); > } > > +static int > +kcs_driver_request(struct ipmi_softc *sc, struct ipmi_request *req, int timo) > +{ > + > + if (KERNEL_PANICKED() || dumping) > + return (kcs_driver_request_poll(sc, req)); > + else > + return (kcs_driver_request_queue(sc, req, timo)); > +} > + > + > int > ipmi_kcs_attach(struct ipmi_softc *sc) > { > diff --git a/sys/dev/ipmi/ipmivars.h b/sys/dev/ipmi/ipmivars.h > index b0548ee3d7c3..1468e86be73b 100644 > --- a/sys/dev/ipmi/ipmivars.h > +++ b/sys/dev/ipmi/ipmivars.h > @@ -111,6 +111,7 @@ struct ipmi_softc { > uint8_t ipmi_dev_support; /* IPMI_ADS_* */ > struct cdev *ipmi_cdev; > TAILQ_HEAD(,ipmi_request) ipmi_pending_requests; > + TAILQ_HEAD(,ipmi_request) ipmi_pending_requests_highpri; > int ipmi_driver_requests_polled; > eventhandler_tag ipmi_power_cycle_tag; > eventhandler_tag ipmi_watchdog_tag; > @@ -237,6 +238,7 @@ void ipmi_complete_request(struct ipmi_softc *, struct ipmi_request *); > struct ipmi_request *ipmi_dequeue_request(struct ipmi_softc *); > void ipmi_free_request(struct ipmi_request *); > int ipmi_polled_enqueue_request(struct ipmi_softc *, struct ipmi_request *); > +int ipmi_polled_enqueue_request_highpri(struct ipmi_softc *, struct ipmi_request *); > int ipmi_submit_driver_request(struct ipmi_softc *, struct ipmi_request *, > int); > >