Re: ZFS + FreeBSD XEN dom0 panic

From: Ze Dupsys <zedupsys_at_gmail.com>
Date: Mon, 21 Mar 2022 14:07:48 UTC
On 2022.03.21. 13:14, Roger Pau Monné wrote:
> On Fri, Mar 18, 2022 at 08:48:08PM +0200, Ze Dupsys wrote:
>> On 2022.03.18. 17:24, Roger Pau Monné wrote:
>>> On Tue, Mar 15, 2022 at 08:51:57AM +0200, Ze Dupsys wrote:
>>>> On 2022.03.14. 11:19, Roger Pau Monné wrote:
>>>>> On Mon, Mar 14, 2022 at 10:06:58AM +0200, Ze Dupsys wrote:
>>>>>> ..
>>>>>>
>>>>>> Why those lines starting "xnb(xnb_detach:1330):" do not have any message?
>>>>>> Could it be that there is a bad pointer to message buffer that can not be
>>>>>> printed? And then sometimes panic happens because access goes out of allowed
>>>>>> memory region?
>>>>> Some messages in netback are just "\n", likely leftovers from debug.
>>>> Okay, found the lines, it is as you say. So this will not be an easy one.
>>>>
>>>>
>>>>> Can you try to stress the system again but this time with guests not
>>>>> having any network interfaces? (so that netback doesn't get used in
>>>>> dom0).
>>>> I'll try to come up with something. At the moment all commands to VMs are
>>>> given through ssh.
>>>>
>>>>
>>>>> Then if you could rebuild the FreeBSD dom0 kernel with the above patch
>>>>> we might be able to get a bit more of info about blkback shutdown.
>>>> I rebuilt 13.1 STABLE, with commenting out #undef and adding #define, thus
>>>> line number will differ by single line. For this test i did not remove
>>>> network interfaces, and did add DPRINTF messages to xnb_detach function as
>>>> well, since i hoped to maybe catch something there, by printing pointers. I
>>>> somewhat did not like that xnb_detach does not check for NULL return from
>>>> device_get_softc, nor for device_t argument, so i though, maybe those
>>>> crashes are something related to that. But i guess this will not be so easy,
>>>> and maybe it is safe to assume that "device_t dev" is always valid in that
>>>> context.
>>>>
>>>> So i ran stress test, system did not crash as it happens often when more
>>>> debugging info is printed, characteristics change. But it did leak sysctl
>>>> xbbd variables. I'll attach all collected log files. sysctl and xl list
>>>> commands differ in timing a little bit. xl list _02 is when all VMs are
>>>> turned off. Sysctl only has keys without values, not to trigger xnb tests
>>>> while reading all values.
>>>
>>> So I've been staring at this for a while, and I'm not yet sure I
>>> figured out exactly what's going on, but can you give a try to the
>>> patch below?
>>>
>>> Thanks, Roger.
>>> ---8<---
>>> diff --git a/sys/xen/xenbus/xenbusb.c b/sys/xen/xenbus/xenbusb.c
>>> index e026f8203ea1..a8b75f46b9cc 100644
>>> --- a/sys/xen/xenbus/xenbusb.c
>>> +++ b/sys/xen/xenbus/xenbusb.c
>>> @@ -254,7 +254,7 @@ xenbusb_delete_child(device_t dev, device_t child)
>>>    static void
>>>    xenbusb_verify_device(device_t dev, device_t child)
>>>    {
>>> -	if (xs_exists(XST_NIL, xenbus_get_node(child), "") == 0) {
>>> +	if (xs_exists(XST_NIL, xenbus_get_node(child), "state") == 0) {
>>>    		/*
>>>    		 * Device tree has been removed from Xenbus.
>>>    		 * Tear down the device.
>>>
>>
>> Hi, Roger,
>>
>> Yes, i will run it, but results will likely be on Monday.
>>
>> I have done some more digging, but since i do not know how things work
>> internally, i have a lot of unknowns and assumptions. I did a lot of DPRINTF
>> stuff in various places, i will attach logs, maybe they give better picture.
>>
>> For easier log reading i filter single log pair with commands:
>> sed -i '' -e 's/0xfffff80/0x/g' *
>> cat *xen_serial_* | grep -E 'xbb|talkv|xs_read|xs_gather|xenbusb_add_device'
>>> xbb.log
>> cat *sysctl* | grep -E '^dev.xbb' | grep -Ev
>> "\.%location:|\.%pnpinfo:|\.%parent" > xbb_sysctl.log
>>
>> Then i search for leaked sysctl *.xenstore_path and search that in xbb.log.
>> As far as i can tell, cases when it is leaked xbb's have no
>> xbb_close_backend call, for whatever reason. Often pointers for leaked xbb
>> are a reused ones from previously closed xbb instnaces.
>>
>>
>> My suspicion goes for blkback.c, xbb_attach_disk lines:
>> ..
>>    error = xs_gather(XST_NIL, xenbus_get_node(dev), "physical-device-path",
>>        NULL, &xbb->dev_name, NULL);
>>    if (error != 0)
>>      return;
>> ..
>>
>> The thing is that for those lost xbb's, this is where error = 2, which is
>> ENOENT (if i found correct code).
>>
>> Which is weird, because i assume that xbb_attach_disk function is called
>> after "hot-plug script" has finished execution. This is defined in
>> xbb_attach(device_t dev):
>> ..
>>    KASSERT(!xbb->hotplug_done, ("Hotplug scripts already executed"));
>>    watch_path = xs_join(xenbus_get_node(xbb->dev), "physical-device-path");
>>    xbb->hotplug_watch.callback_data = (uintptr_t)dev;
>>    xbb->hotplug_watch.callback = xbb_attach_disk;
>> ..
>>
>> So i added debugging messages at file: /usr/local/etc/xen/scripts/block
>> after line "xenstore-write $path/physical-device-path $params", but did not
>> see anything suspicious.
>>
>> One of my ideas is that, maybe for some reason callback is called too ealry,
>> because i would expect that when the callback is called, the path
>> $path/physical-device-path should exist (is created by script). But then how
>> can xs_gather fail with ENOENT there? Something manages to remove
>> $path/physical-device-path in the short window between xenstore-write and
>> while callback activates?
>>
>> Another thing is that function xbb_attach_disk on error for "error =
>> xs_gather" just returns if != 0. I would expect it to be able to handle
>> error gracefully it should flag xbb with an error and then shut down or
>> something, but since i do not know internals, it is hard to judge correct
>> behaviour in that context.
>>
>> Another interesting behaviour is that xbb_shutdown is called twice for each
>> instance. Once from some function and then from xbb_detach.
> 
> This is expected. On proper domain shutdown the backend will be closed
> when the frontend switches to Closing state. Hence the cal from
> xbb_detach is for cases where the backend hasn't been closed properly
> (ie: domain destroy or stuck frontend).
> 
>>
>> My problem of geting more useful info is that i do not understand the flow
>> of VM creation, including disk allocation.
>>
>> For my understanding they just "magically" appear detected with xbb_probe.
>> :)
>>
>> I tried to understand function xs_gather called by xbb_attach, implemented
>> by dev/xen/xenstore/xenstore.c, but it just writes packet somewhere; i
>> assumed that dev/xen/xenstore/xenstore_dev.c is the one that handles other
>> end for that packet, so i looked at "case XS_READ:" in that file, but it
>> calls xs_dev_request_and_reply which is implemented by
>> dev/xen/xenstore/xenstore.c, so for me it feels like going in circles. So at
>> the moment i did not find which component could be the one that sends
>> ENOENT.
>>
>> Could you tell the big picture how VM block dev is created? I mean is it xen
>> tools writing request in xenstore_dev, but how it then activates xbb probe?
>> Which component is responsible for allocating memory xbb instance?
> 
> PV devices are all handled in xenstore (a.k.a. xenbus) which is a kind
> of key-value database that contains the information about devices.
> 
> The FreeBSD kernel sets a watch on certain xenstore paths, so that
> when new entries are added a specific function is called on order to
> detect whether devices are being added or removed.
> 
>>
>> I added printf in dev/xen/xenstore/xenstore.c
>> ..
>> xs_talkv(struct xs_transaction t, enum xsd_sockmsg_type request_type,
>>      const struct iovec *iovec, u_int num_vecs, u_int *len, void **result)
>> {
>> ..
>>    if (msg.type == XS_ERROR) {
>>      error = xs_get_error(ret);
>>      // .. printf here ..
>>      free(ret, M_XENSTORE);
>> ..
>> And it prints error 22 a lot (EINVAL, if like in errno.h) and sometimes
>> error 2 (this is the one i suppose i saw at mentioned xs_gather call). The
>> trouble is that i can not locate the source for xs_talkv error generation +
>> when all error messages are printed on serial, it starts to influence
>> overall system behaviour, since serial can not keep up with message
>> generation speed and as a result sysctl leak does not trigger as fast.
>>
>> Could i create static variable in xenstore.c and set it to 1 from blkback.c
>> when i want logging and to 0 when i don't? Or is it a global component and
>> when blkback.c would set's it's logging to 1, all other concurrent messages
>> will be logged?
>>
>> I would not expect xs_talkv to encounter as many error 22 cases, as it does.
>> Could this be relevant? Or could it be that it's just by design that some
>> functions test if key exists before creating new one?
>>
>> I stared at xs_talkv function for a while. It calls xs_write_store twice, so
>> i thought maybe there could be a case when message is not writen
>> "atomically"? Thus error 22. Yes, it does sx_xlock(&xs.request_mutex), but
>> function xs_write_store does mtx_lock(&xs.ring_lock); and
>> mtx_unlock(&xs.ring_lock); in an interesting manner. OTOH if this really was
>> even a bit wrong, then system would be far more unstable maybe even
>> unusable, right?
>>
>> As for now, it seems that actually the third process that was creating zfs
>> volumes is not important. What is important is load on system, not load on
>> ZVOLs, because i was running VM1 and VM2 on/off stress without zfs create/dd
>> script, but was building kernel on parallel, sysctl variables were leaked.
>> Kernel was building on UFS ada0, VM disks on ZVOLs ada1.
>>
>> For today i guess this is it.
>>
>> Thanks for looking into this and sorry i can not produce more useful error
>> logs. Earlier dated has no error code prints. Sysctl variables are taken
>> when all VMs are down.
> 
> I think the problem is not likely with the xenstore implementation
> (ie: xs_talkv) but rather a race with how the FreeBSD kernel detects
> and manages addition and removal of devices that hang off xenbus.
> 
> I'm afraid there's too much data below for me to parse it.

Understood. Sounds more tricky than i thought. What could i do to make 
data more useful?


> Let me know whether the proposed patch made any difference.

It did not make any difference, system still panics and sysctl leaks.


Thanks.