Re: ZFS + FreeBSD XEN dom0 panic
- Reply: Roger Pau Monné : "Re: ZFS + FreeBSD XEN dom0 panic"
- In reply to: Roger Pau Monné : "Re: ZFS + FreeBSD XEN dom0 panic"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Mon, 21 Mar 2022 14:07:48 UTC
On 2022.03.21. 13:14, Roger Pau Monné wrote: > On Fri, Mar 18, 2022 at 08:48:08PM +0200, Ze Dupsys wrote: >> On 2022.03.18. 17:24, Roger Pau Monné wrote: >>> On Tue, Mar 15, 2022 at 08:51:57AM +0200, Ze Dupsys wrote: >>>> On 2022.03.14. 11:19, Roger Pau Monné wrote: >>>>> On Mon, Mar 14, 2022 at 10:06:58AM +0200, Ze Dupsys wrote: >>>>>> .. >>>>>> >>>>>> Why those lines starting "xnb(xnb_detach:1330):" do not have any message? >>>>>> Could it be that there is a bad pointer to message buffer that can not be >>>>>> printed? And then sometimes panic happens because access goes out of allowed >>>>>> memory region? >>>>> Some messages in netback are just "\n", likely leftovers from debug. >>>> Okay, found the lines, it is as you say. So this will not be an easy one. >>>> >>>> >>>>> Can you try to stress the system again but this time with guests not >>>>> having any network interfaces? (so that netback doesn't get used in >>>>> dom0). >>>> I'll try to come up with something. At the moment all commands to VMs are >>>> given through ssh. >>>> >>>> >>>>> Then if you could rebuild the FreeBSD dom0 kernel with the above patch >>>>> we might be able to get a bit more of info about blkback shutdown. >>>> I rebuilt 13.1 STABLE, with commenting out #undef and adding #define, thus >>>> line number will differ by single line. For this test i did not remove >>>> network interfaces, and did add DPRINTF messages to xnb_detach function as >>>> well, since i hoped to maybe catch something there, by printing pointers. I >>>> somewhat did not like that xnb_detach does not check for NULL return from >>>> device_get_softc, nor for device_t argument, so i though, maybe those >>>> crashes are something related to that. But i guess this will not be so easy, >>>> and maybe it is safe to assume that "device_t dev" is always valid in that >>>> context. >>>> >>>> So i ran stress test, system did not crash as it happens often when more >>>> debugging info is printed, characteristics change. But it did leak sysctl >>>> xbbd variables. I'll attach all collected log files. sysctl and xl list >>>> commands differ in timing a little bit. xl list _02 is when all VMs are >>>> turned off. Sysctl only has keys without values, not to trigger xnb tests >>>> while reading all values. >>> >>> So I've been staring at this for a while, and I'm not yet sure I >>> figured out exactly what's going on, but can you give a try to the >>> patch below? >>> >>> Thanks, Roger. >>> ---8<--- >>> diff --git a/sys/xen/xenbus/xenbusb.c b/sys/xen/xenbus/xenbusb.c >>> index e026f8203ea1..a8b75f46b9cc 100644 >>> --- a/sys/xen/xenbus/xenbusb.c >>> +++ b/sys/xen/xenbus/xenbusb.c >>> @@ -254,7 +254,7 @@ xenbusb_delete_child(device_t dev, device_t child) >>> static void >>> xenbusb_verify_device(device_t dev, device_t child) >>> { >>> - if (xs_exists(XST_NIL, xenbus_get_node(child), "") == 0) { >>> + if (xs_exists(XST_NIL, xenbus_get_node(child), "state") == 0) { >>> /* >>> * Device tree has been removed from Xenbus. >>> * Tear down the device. >>> >> >> Hi, Roger, >> >> Yes, i will run it, but results will likely be on Monday. >> >> I have done some more digging, but since i do not know how things work >> internally, i have a lot of unknowns and assumptions. I did a lot of DPRINTF >> stuff in various places, i will attach logs, maybe they give better picture. >> >> For easier log reading i filter single log pair with commands: >> sed -i '' -e 's/0xfffff80/0x/g' * >> cat *xen_serial_* | grep -E 'xbb|talkv|xs_read|xs_gather|xenbusb_add_device' >>> xbb.log >> cat *sysctl* | grep -E '^dev.xbb' | grep -Ev >> "\.%location:|\.%pnpinfo:|\.%parent" > xbb_sysctl.log >> >> Then i search for leaked sysctl *.xenstore_path and search that in xbb.log. >> As far as i can tell, cases when it is leaked xbb's have no >> xbb_close_backend call, for whatever reason. Often pointers for leaked xbb >> are a reused ones from previously closed xbb instnaces. >> >> >> My suspicion goes for blkback.c, xbb_attach_disk lines: >> .. >> error = xs_gather(XST_NIL, xenbus_get_node(dev), "physical-device-path", >> NULL, &xbb->dev_name, NULL); >> if (error != 0) >> return; >> .. >> >> The thing is that for those lost xbb's, this is where error = 2, which is >> ENOENT (if i found correct code). >> >> Which is weird, because i assume that xbb_attach_disk function is called >> after "hot-plug script" has finished execution. This is defined in >> xbb_attach(device_t dev): >> .. >> KASSERT(!xbb->hotplug_done, ("Hotplug scripts already executed")); >> watch_path = xs_join(xenbus_get_node(xbb->dev), "physical-device-path"); >> xbb->hotplug_watch.callback_data = (uintptr_t)dev; >> xbb->hotplug_watch.callback = xbb_attach_disk; >> .. >> >> So i added debugging messages at file: /usr/local/etc/xen/scripts/block >> after line "xenstore-write $path/physical-device-path $params", but did not >> see anything suspicious. >> >> One of my ideas is that, maybe for some reason callback is called too ealry, >> because i would expect that when the callback is called, the path >> $path/physical-device-path should exist (is created by script). But then how >> can xs_gather fail with ENOENT there? Something manages to remove >> $path/physical-device-path in the short window between xenstore-write and >> while callback activates? >> >> Another thing is that function xbb_attach_disk on error for "error = >> xs_gather" just returns if != 0. I would expect it to be able to handle >> error gracefully it should flag xbb with an error and then shut down or >> something, but since i do not know internals, it is hard to judge correct >> behaviour in that context. >> >> Another interesting behaviour is that xbb_shutdown is called twice for each >> instance. Once from some function and then from xbb_detach. > > This is expected. On proper domain shutdown the backend will be closed > when the frontend switches to Closing state. Hence the cal from > xbb_detach is for cases where the backend hasn't been closed properly > (ie: domain destroy or stuck frontend). > >> >> My problem of geting more useful info is that i do not understand the flow >> of VM creation, including disk allocation. >> >> For my understanding they just "magically" appear detected with xbb_probe. >> :) >> >> I tried to understand function xs_gather called by xbb_attach, implemented >> by dev/xen/xenstore/xenstore.c, but it just writes packet somewhere; i >> assumed that dev/xen/xenstore/xenstore_dev.c is the one that handles other >> end for that packet, so i looked at "case XS_READ:" in that file, but it >> calls xs_dev_request_and_reply which is implemented by >> dev/xen/xenstore/xenstore.c, so for me it feels like going in circles. So at >> the moment i did not find which component could be the one that sends >> ENOENT. >> >> Could you tell the big picture how VM block dev is created? I mean is it xen >> tools writing request in xenstore_dev, but how it then activates xbb probe? >> Which component is responsible for allocating memory xbb instance? > > PV devices are all handled in xenstore (a.k.a. xenbus) which is a kind > of key-value database that contains the information about devices. > > The FreeBSD kernel sets a watch on certain xenstore paths, so that > when new entries are added a specific function is called on order to > detect whether devices are being added or removed. > >> >> I added printf in dev/xen/xenstore/xenstore.c >> .. >> xs_talkv(struct xs_transaction t, enum xsd_sockmsg_type request_type, >> const struct iovec *iovec, u_int num_vecs, u_int *len, void **result) >> { >> .. >> if (msg.type == XS_ERROR) { >> error = xs_get_error(ret); >> // .. printf here .. >> free(ret, M_XENSTORE); >> .. >> And it prints error 22 a lot (EINVAL, if like in errno.h) and sometimes >> error 2 (this is the one i suppose i saw at mentioned xs_gather call). The >> trouble is that i can not locate the source for xs_talkv error generation + >> when all error messages are printed on serial, it starts to influence >> overall system behaviour, since serial can not keep up with message >> generation speed and as a result sysctl leak does not trigger as fast. >> >> Could i create static variable in xenstore.c and set it to 1 from blkback.c >> when i want logging and to 0 when i don't? Or is it a global component and >> when blkback.c would set's it's logging to 1, all other concurrent messages >> will be logged? >> >> I would not expect xs_talkv to encounter as many error 22 cases, as it does. >> Could this be relevant? Or could it be that it's just by design that some >> functions test if key exists before creating new one? >> >> I stared at xs_talkv function for a while. It calls xs_write_store twice, so >> i thought maybe there could be a case when message is not writen >> "atomically"? Thus error 22. Yes, it does sx_xlock(&xs.request_mutex), but >> function xs_write_store does mtx_lock(&xs.ring_lock); and >> mtx_unlock(&xs.ring_lock); in an interesting manner. OTOH if this really was >> even a bit wrong, then system would be far more unstable maybe even >> unusable, right? >> >> As for now, it seems that actually the third process that was creating zfs >> volumes is not important. What is important is load on system, not load on >> ZVOLs, because i was running VM1 and VM2 on/off stress without zfs create/dd >> script, but was building kernel on parallel, sysctl variables were leaked. >> Kernel was building on UFS ada0, VM disks on ZVOLs ada1. >> >> For today i guess this is it. >> >> Thanks for looking into this and sorry i can not produce more useful error >> logs. Earlier dated has no error code prints. Sysctl variables are taken >> when all VMs are down. > > I think the problem is not likely with the xenstore implementation > (ie: xs_talkv) but rather a race with how the FreeBSD kernel detects > and manages addition and removal of devices that hang off xenbus. > > I'm afraid there's too much data below for me to parse it. Understood. Sounds more tricky than i thought. What could i do to make data more useful? > Let me know whether the proposed patch made any difference. It did not make any difference, system still panics and sysctl leaks. Thanks.