git: b339ab149105 - main - ciss: Don't panic on null CR ciss_dequeue_notify

From: Warner Losh <imp_at_FreeBSD.org>
Date: Mon, 14 Oct 2024 05:41:02 UTC
The branch main has been updated by imp:

URL: https://cgit.FreeBSD.org/src/commit/?id=b339ab1491055d89415f85b6d1a03423193178f9

commit b339ab1491055d89415f85b6d1a03423193178f9
Author:     Peter Eriksson <pen@lysator.liu.se>
AuthorDate: 2024-10-14 04:01:33 +0000
Commit:     Warner Losh <imp@FreeBSD.org>
CommitDate: 2024-10-14 05:37:46 +0000

    ciss: Don't panic on null CR ciss_dequeue_notify
    
    Apparently, sometimes on hot plug/unplug, a null cr comes back from
    ciss_dequeue_notify. This is clearly a bug, and by ignoring it we're
    papering over that bug. We only ever wake the thread after enqueing a
    notification or setting a bit about killing the thread, so once we check
    the bit isn't the cause, cr can't be NULL unless something else has
    dequeued it.
    
    Ideally, this would be fixed, rather than papered over, but this makes a
    very old card somewhat more useable for external enclosures. I suspect
    it's a race when we set CISS_THREAD_SHUT and another flag (the latter
    w/o ciss_mtx held), but I don't see it and w/o hardware to reproduce
    it would be hard to know for sure.
    
    PR: 246279
    Reviewed by: imp
    Tested by: Marek Zarychta
    Differential Revision: https://reviews.freebsd.org/D25155
---
 sys/dev/ciss/ciss.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/sys/dev/ciss/ciss.c b/sys/dev/ciss/ciss.c
index d4786302d928..d4ede91f6b35 100644
--- a/sys/dev/ciss/ciss.c
+++ b/sys/dev/ciss/ciss.c
@@ -4207,8 +4207,20 @@ ciss_notify_thread(void *arg)
 
 	cr = ciss_dequeue_notify(sc);
 
-	if (cr == NULL)
-		panic("cr null");
+	if (cr == NULL) {
+		/*
+		 * We get a NULL message sometimes when unplugging/replugging
+		 * stuff But this indicates a bug, since we only wake this thread
+		 * when we (a) set the THREAD_SHUT flag, or (b) we have enqueued
+		 * something. Since it's reported around errors, it may be a
+		 * locking bug related to ciss_flags being modified in multiple
+		 * threads some without ciss_mtx held. Or there's some other
+		 * way we either fail to sleep or corrupt the ciss_flags.
+		 */
+		ciss_printf(sc, "Driver bug: NULL notify event received\n");
+		continue;
+	}
+
 	cn = (struct ciss_notify *)cr->cr_data;
 
 	switch (cn->class) {