Re: ZFS + FreeBSD XEN dom0 panic

In reply to: Ze Dupsys : "Re: ZFS + FreeBSD XEN dom0 panic"
Go to: [ bottom of page ] [ top of archives ] [ this month ]
From: Roger Pau Monné <roger.pau_at_citrix.com>
Date: Mon, 14 Mar 2022 09:19:58 UTC
On Mon, Mar 14, 2022 at 10:06:58AM +0200, Ze Dupsys wrote:
> I'd like to share more analysis on given problem. I do not know if this
> somehow helps or not, but i have noticed that across all my saved serial log
> outputs, panic messages follow after some of these lines.
> 
> ..
> (XEN) HVM d34v0 save: TSC_ADJUST
> (XEN) HVM d34v0 save: CPU_MSR
> (XEN) HVM34 restore: CPU 0
> xnb(xnb_detach:1330):
> xnb(xnb_detach:1339):
> .. => panic
> 
> 
> Most of panics are like this
> ..
> (XEN) HVM d26v0 save: TSC_ADJUST
> (XEN) HVM d26v0 save: CPU_MSR
> (XEN) HVM26 restore: CPU 0
> .. => panic
> 
> ..
> (XEN) HVM d42v0 save: TSC_ADJUST
> (XEN) HVM d42v0 save: CPU_MSR
> (XEN) HVM42 restore: CPU 0
> xnb(xnb_detach:1330):
> xnb(xnb_detach:1339):
> xnb(xnb_detach:1330):
> xnb(xnb_detach:1339):
> .. => panic
> 
> 
> This one i think had different stressing conditions than other's, but i
> don't remember
> ..
> (XEN) HVM d660v0 save: CPU_MSR
> (XEN) HVM660 restore: CPU 0
> (XEN) d659v0: upcall vector 93
> spin lock 0xffffffff81eaa780 (sched lock 1) held by 0xfffff8020152d000 (tid
> 100434) too long
> timeout stopping cpus
> panic: spin lock held too long

That one seems to be a watchdog panic, albeit it's quite likely cause
by an out of memory condition.

> .. => panic
> 
> 
> For serial output in middle when there are no crashes i have noticed that
> there are at least 2 different execution paths.
> 
> For most VM's boot flow continues with serial lines like these:
> ..
> (XEN) HVM1 restore: CPU 0
> xnb(xnb_probe:1123): Claiming device 0, xnb
> xnb(xnb_attach:1267): Attaching to backend/vif/1/0
> xnb(xnb_frontend_changed:1391): frontend_state=Initialising,
> xnb_state=InitWait
> (d1) HVM Loader
> ..
> 
> For some though, there are lines like these, but they still boot, it just
> seemed that these lines might be a possible continuation for "unsuccessful
> panic".
> ..
> (XEN) HVM3 restore: CPU 0
> xnb(xnb_detach:1330):
> xnb(xnb_detach:1339):
> xnb(xnb_detach:1330):
> xnb(xnb_detach:1339):
> xnb(xnb_probe:1123): Claiming device 0, xnb
> xnb(xnb_attach:1267): Attaching to backend/vif/3/0
> xnb(xnb_frontend_changed:1391): frontend_state=Initialising,
> xnb_state=InitWait
> (d3) HVM Loader
> ..
> 
> Why those lines starting "xnb(xnb_detach:1330):" do not have any message?
> Could it be that there is a bad pointer to message buffer that can not be
> printed? And then sometimes panic happens because access goes out of allowed
> memory region?

Some messages in netback are just "\n", likely leftovers from debug.

Can you try to stress the system again but this time with guests not
having any network interfaces? (so that netback doesn't get used in
dom0).

Then if you could rebuild the FreeBSD dom0 kernel with the above patch
we might be able to get a bit more of info about blkback shutdown.

Thanks, Roger.

---8<---
diff --git a/sys/dev/xen/blkback/blkback.c b/sys/dev/xen/blkback/blkback.c
index 792933402c93..84ebb9068881 100644
--- a/sys/dev/xen/blkback/blkback.c
+++ b/sys/dev/xen/blkback/blkback.c
@@ -125,7 +125,7 @@ __FBSDID("$FreeBSD$");
 /**
  * \brief Define to enable rudimentary request logging to the console.
  */
-#undef XBB_DEBUG
+#define XBB_DEBUG 1
 
 /*---------------------------------- Macros ----------------------------------*/
 /**