Re: ZFS + FreeBSD XEN dom0 panic

From: Roger Pau Monné <roger.pau_at_citrix.com>
Date: Fri, 18 Mar 2022 15:24:17 UTC
On Tue, Mar 15, 2022 at 08:51:57AM +0200, Ze Dupsys wrote:
> On 2022.03.14. 11:19, Roger Pau Monné wrote:
> > On Mon, Mar 14, 2022 at 10:06:58AM +0200, Ze Dupsys wrote:
> > > ..
> > > 
> > > Why those lines starting "xnb(xnb_detach:1330):" do not have any message?
> > > Could it be that there is a bad pointer to message buffer that can not be
> > > printed? And then sometimes panic happens because access goes out of allowed
> > > memory region?
> > Some messages in netback are just "\n", likely leftovers from debug.
> Okay, found the lines, it is as you say. So this will not be an easy one.
> 
> 
> > Can you try to stress the system again but this time with guests not
> > having any network interfaces? (so that netback doesn't get used in
> > dom0).
> I'll try to come up with something. At the moment all commands to VMs are
> given through ssh.
> 
> 
> > Then if you could rebuild the FreeBSD dom0 kernel with the above patch
> > we might be able to get a bit more of info about blkback shutdown.
> I rebuilt 13.1 STABLE, with commenting out #undef and adding #define, thus
> line number will differ by single line. For this test i did not remove
> network interfaces, and did add DPRINTF messages to xnb_detach function as
> well, since i hoped to maybe catch something there, by printing pointers. I
> somewhat did not like that xnb_detach does not check for NULL return from
> device_get_softc, nor for device_t argument, so i though, maybe those
> crashes are something related to that. But i guess this will not be so easy,
> and maybe it is safe to assume that "device_t dev" is always valid in that
> context.
> 
> So i ran stress test, system did not crash as it happens often when more
> debugging info is printed, characteristics change. But it did leak sysctl
> xbbd variables. I'll attach all collected log files. sysctl and xl list
> commands differ in timing a little bit. xl list _02 is when all VMs are
> turned off. Sysctl only has keys without values, not to trigger xnb tests
> while reading all values.

So I've been staring at this for a while, and I'm not yet sure I
figured out exactly what's going on, but can you give a try to the
patch below?

Thanks, Roger.
---8<---
diff --git a/sys/xen/xenbus/xenbusb.c b/sys/xen/xenbus/xenbusb.c
index e026f8203ea1..a8b75f46b9cc 100644
--- a/sys/xen/xenbus/xenbusb.c
+++ b/sys/xen/xenbus/xenbusb.c
@@ -254,7 +254,7 @@ xenbusb_delete_child(device_t dev, device_t child)
 static void
 xenbusb_verify_device(device_t dev, device_t child)
 {
-	if (xs_exists(XST_NIL, xenbus_get_node(child), "") == 0) {
+	if (xs_exists(XST_NIL, xenbus_get_node(child), "state") == 0) {
 		/*
 		 * Device tree has been removed from Xenbus.
 		 * Tear down the device.