Re: ZFS + FreeBSD XEN dom0 panic

Reply: Ze Dupsys : "Re: ZFS + FreeBSD XEN dom0 panic"
Go to: [ bottom of page ] [ top of archives ] [ this month ]
From: Roger Pau Monné <roger.pau_at_citrix.com>
Date: Tue, 05 Apr 2022 15:22:39 UTC
On Sun, Apr 03, 2022 at 09:54:24AM +0300, Ze Dupsys wrote:
> On 2022.03.27. 12:13, Roger Pau Monné wrote:
> > ..
> > Thanks, unfortunately that patch was incomplete. I have an updated
> > version that I think is better now, and I've slightly tested it
> > (creating and destroying a domain with it doesn't seem to crash).
> > Appended patch at the end of the message.
> 
> Hi,
> 
> This patch was far better, i almost wanted to say that it works, stressed
> system with 2G RAM and it did not even have signs of sysctl-var leaks. There
> were too many things going on, thus i most probably will not be able to
> reproduce this case, but just before panic i did "xl list" command and it
> instantly crashed (new trace). What i noticed after restart, that again some
> default nightly script had made /var/backup/* files which made root file
> system full. Since test overlaid 2 nights, in first day when root disk was
> full, system did not panic, i just rm'ed /var/backup biggest file to make
> sure that there are no problems. When i did "xl list" both test VMs were
> running, state i do not know.
> 
> Full serial log in attachment, part 2 is where most interesting stuff is,
> ending is a bit mess, but:
> ..
> (XEN) d284v0: upcall vector 93
> Apr  3 08:10:11 lab-01 xenstored[937]: TDB: expand_file to 229376 failed (No
> space left on devicex)bbd24: Error 5 w
> riting backend/vbApd/284/51760/sectors
> r  3 08:10:11 lab-01 kernel: xbbd24: Fapid 9tal error. T37 ransitioning to
> Closing St(xensate
> tored), uid 0 inumber 2003596 on /: filesystekernel trap m ful12 lw
> ith interrupts disableApd
> 
> r
> Fatal trap 3 08: 110:11 2: page fault while in kerlab-nel mode
> 0cp1uid = 0 ; apicx id = 00
> enstofault virtual address	= 0red[9x20
> 37fault co]: code		= surpervisor read data, page not presentru
> inptistruction pointer	= 0x20:0xffffffff80c94e80on
>  destack pointer	        = 0x28:0xfffffe0051tected8803c0
> frame  pobyinter	        = 0x28:0xfffffe00518803d0
>  connecode segment		= basce 0x0, limit 0xfffff, type 0x1b
> tion			= DPL 0, pres 1 , long 1, def32 0, gran 1
> proces0: sor eflags	= resumerre, IOPL = 0 No sp
> current process		= 16 (xenwatch)
> traap number	ce	= 12
>   panic: page fault
> cpuid = 0
> time = 1648962612
> KDB: stack backtrace:
> #0 0xffffffff80c7c285 at kdb_backtrace+0x65
> #1 0xffffffff80c2e2e1 at vpanic+0x181
> #2 0xffffffff80c2e153 at panic+0x43
> #3 0xffffffff810c8b97 at trap+0xba7
> #4 0xffffffff810c8bef at trap+0xbff
> #5 0xffffffff810c8243 at trap+0x253
> #6 0xffffffff810a0848 at calltrap+0x8
> #7 0xffffffff80c0b87a at __mtx_unlock_sleep+0x7a
> #8 0xffffffff80a98724 at xbd_instance_create+0x7aa4
> #9 0xffffffff80a9abb0 at xbd_instance_create+0x9f30
> #10 0xffffffff80f95c64 at xenbusb_localend_changed+0x7c4
> #11 0xffffffff80ab0f04 at xs_unlock+0x704
> #12 0xffffffff80beaeee at fork_exit+0x7e
> #13 0xffffffff810a18be at fork_trampoline+0xe
> Uptime: 1d10h56m34s

Thanks, sorry for the late reply, somehow the message slip.

I've been able to get the file:line for those, and the trace is kind
of weird, I'm not sure I know what's going on TBH. It seems to me the
backend instance got freed while being in the process of connecting.

I've made some changes, that might mitigate this, but having not a
clear understanding of what's going on makes this harder.

I've pushed the changes to:

http://xenbits.xen.org/gitweb/?p=people/royger/freebsd.git;a=shortlog;h=refs/heads/for-leak

(This is on top of main branch).

I'm also attaching the two patches on this email.

Let me know if those make a difference to stabilize the system.

Thanks, Roger.