Re: ZFS + FreeBSD XEN dom0 panic

From: Ze Dupsys <zedupsys_at_gmail.com>
Date: Thu, 10 Mar 2022 17:46:21 UTC
On 2022.03.10. 17:40, Roger Pau Monné wrote:
> That's not expected. Can you paste the output of `xenstore-ls -fp`
> when you get those stale entries?
Yes, i can (xenstore_ls_fp.log), today i was stressing 12.1, 8GB RAM for 
10 hours.
I will power off all DomUs attach output of commands to this email.


> Also, what does `xl list`?
I attached xl_list.log, not to loose formatting due to email software. 
Added "sysctl -a -N | grep xbbd" as well, since there can be seen that 
sysctl variables exist.


> Also, can you check the log files at '/var/log/xen/xl-*.log' (where * is
> the domain name) to try to gather why the backend is not properly
> destroyed?
Those files seem not to be helpful, I have not seen different messages 
than those, i.e xl-xen-vm2-zvol-5.log.7:
Waiting for domain xen-vm2-zvol-5 (domid 486) to die [pid 11628]
Domain 486 has shut down, reason code 0 0x0
Action for shutdown reason code 0 is destroy
Domain 486 needs to be cleaned up: destroying the domain
Done. Exiting now


BTW is it possible to disable automatic log file rotation in 
/var/log/xen? Or make it always append to single file so that no info is 
lost? Maybe there has been some info for domain ID 200 for example, it's 
just that i have never seen different messages there.


Same goes for qemu-dm-xen-* files, nothing eye catchy:
qemu-system-i386: -serial pty: char device redirected to /dev/pts/6 
(label serial0)
VNC server running on 0.0.0.0:5907
qemu-system-i386: terminating on signal 1 from pid 24773 (xl)


> Right, at some point you will run out of memory if resources are not
> properly freed. Can you try to boot with "boot_verbose=YES" in
> /boot/loader.conf and see if that gives you are more information?
Yes, i will add and see if some more info is available.


> Otherwise I might have to provide you with a patch to blkback in order
> to attempt to detect why backends are not destroyed.
Sure. I'll just have to learn how to compile Xen from source then. I 
think i saw somewhere a howto.

Which version of FreeBSD should i stress? 13.0? Or then it all must be 
built from current source?


> Since you stress the system quite badly, do you by any chance see 'xl'
> processes getting terminated? Background xl processes being killed
> will lead to backends not being properly shut down.
I wouldn't stress it so much if it behaved in the first place when the 
load was normal :)

I've never seen any process containing xl in it's name being killed, 
qemu-system-i386 have been OOM killed though, but on those instances 
usually VM just does not start or cleans up and is gone. And i've seen 
system panic without OOM kill messages.

For the crash cases on normal load, i just noticed that when that 
happened it was always when i had some sort of HDD load, but i never 
remembered how many VMs i had started or restarted. Thus i invented 
these stress test scripts in hope to find the reason why those crashes 
ever happened. I even would not mind if just DomU's would crash, as long 
as their HDDs are not corrupt and Dom0 stays stable and can auto reboot 
them.

As for tests, i've been experimenting in various ways to somehow 
understand if it is due to VM restart count, attached HDD count or data 
throughput on HDDs. But for now i'm going in circles, because sometimes 
panic happens early, but sometimes it happens between domID 2 to 3 
hundred or later and those non-deterministic cases are hard to locate.

Thanks!