From nobody Mon Mar 21 14:07:48 2022 X-Original-To: freebsd-xen@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 8D31F1A26952 for ; Mon, 21 Mar 2022 14:08:09 +0000 (UTC) (envelope-from zedupsys@gmail.com) Received: from mail-lf1-x130.google.com (mail-lf1-x130.google.com [IPv6:2a00:1450:4864:20::130]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4KMc085llzz4g5S for ; Mon, 21 Mar 2022 14:08:08 +0000 (UTC) (envelope-from zedupsys@gmail.com) Received: by mail-lf1-x130.google.com with SMTP id 5so6604751lfp.1 for ; Mon, 21 Mar 2022 07:08:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=reply-to:subject:to:cc:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-language:content-transfer-encoding; bh=oqnB22vBeUomZwxeahF3u+nOQ9GSkNIAwVYoL83pIbs=; b=Fo2rLJ5QfNSRRlz9hu8QcQ2qtYe0LqDV5kOADK9DM174LAdqLhF6JHPDRPsigYxt6X f9GhnLkyDPQH2qTMVzYFvqbbzyu6MIkvQnUaklOUVuBO5EqzTKFFeZg9PAwM4pruA9Xa YkX3KEtsEGK7p7nGSsqVYPg6l5QE8zxL2MJ96vDdvj+j0xx0xqRoJV1if2nqO6vvAJPy hsKi/tZD5F9Wm4KfolYxbegkbNe+K5g8cyCS00YJR5tI/UX7yGrJzWfO/hZgVaet1zt8 5OfnIfV+nkPS0Dx8V3GBBcFqUvnq4dbzNmUZM4oAANJXi1S3/42L77oJfmI7yoKDry+q D1kw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:reply-to:subject:to:cc:references:from :message-id:date:user-agent:mime-version:in-reply-to :content-language:content-transfer-encoding; bh=oqnB22vBeUomZwxeahF3u+nOQ9GSkNIAwVYoL83pIbs=; b=1inpmiS67nWmqJlVEOGqPk26X3XkLByetm+aK4DTk4yX9mFIWR4jS8Bm3crcDpKr0z IYMdJz5rgyo3KK/ZODOjC+7EDF/50vPR8blE2ve6lR/X5yawIlf2od1LKX3io0U8jHbj cMnjlv8lIXC239uhXxT+fC6y3u4RWZQJJlMrSOCxpszwwMFB1Ne7NNhOCXgHSNvm93Wt 0o/VW2wLTFLXwEVy6sdKSc91IAMU2MhjAoNdGtNtdoI+yV/pXF3uedlC6ORm+W1wDHqH 768x1c1ygEmAWAflAtfbcp7Bdzlirb/GOGMRmcVZUTIvLUuOJ+WYuuC1uzwCfpERmnw0 3gtA== X-Gm-Message-State: AOAM531MK1U2dboSpRHTYmfOt0R1O1W2KRsEp2+N/xzTfM9VcPpTe19M 73mmjA9YfTN/xOsINkStz7XEgVaxJ08= X-Google-Smtp-Source: ABdhPJzn2bXTEYVM1v6N0wgu6zBIrb0Q4/iVZEm/kYtCLanz4kmU+Sn6SOa1qDhAISvS2zJ6cBONlA== X-Received: by 2002:a05:6512:b11:b0:44a:2ead:daf2 with SMTP id w17-20020a0565120b1100b0044a2eaddaf2mr2767724lfu.642.1647871687049; Mon, 21 Mar 2022 07:08:07 -0700 (PDT) Received: from [10.3.0.1] ([213.110.65.3]) by smtp.googlemail.com with ESMTPSA id s10-20020a19ad4a000000b0044826a25a2esm1811387lfd.292.2022.03.21.07.08.06 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 21 Mar 2022 07:08:06 -0700 (PDT) Reply-To: zedupsys@gmail.com Subject: Re: ZFS + FreeBSD XEN dom0 panic To: =?UTF-8?Q?Roger_Pau_Monn=c3=a9?= Cc: freebsd-xen@freebsd.org, buhrow@nfbcal.org References: <3d4691a7-c4b3-1c91-9eaa-7af071561bb6@gmail.com> <5dfdecd5-f94d-29b4-791e-0adde5405cf5@gmail.com> From: Ze Dupsys Message-ID: Date: Mon, 21 Mar 2022 16:07:48 +0200 User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.12.0 List-Id: Discussion List-Archive: https://lists.freebsd.org/archives/freebsd-xen List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-xen@freebsd.org X-BeenThere: freebsd-xen@freebsd.org MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: lv Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: 4KMc085llzz4g5S X-Spamd-Bar: -- Authentication-Results: mx1.freebsd.org; dkim=pass header.d=gmail.com header.s=20210112 header.b=Fo2rLJ5Q; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (mx1.freebsd.org: domain of zedupsys@gmail.com designates 2a00:1450:4864:20::130 as permitted sender) smtp.mailfrom=zedupsys@gmail.com X-Spamd-Result: default: False [-2.00 / 15.00]; HAS_REPLYTO(0.00)[zedupsys@gmail.com]; RCVD_VIA_SMTP_AUTH(0.00)[]; TO_DN_SOME(0.00)[]; R_SPF_ALLOW(-0.20)[+ip6:2a00:1450:4000::/36:c]; FREEMAIL_FROM(0.00)[gmail.com]; REPLYTO_ADDR_EQ_FROM(0.00)[]; RCVD_COUNT_THREE(0.00)[3]; DKIM_TRACE(0.00)[gmail.com:+]; DMARC_POLICY_ALLOW(-0.50)[gmail.com,none]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+]; FREEMAIL_ENVFROM(0.00)[gmail.com]; ASN(0.00)[asn:15169, ipnet:2a00:1450::/32, country:US]; MID_RHS_MATCH_FROM(0.00)[]; DWL_DNSWL_NONE(0.00)[gmail.com:dkim]; ARC_NA(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; R_DKIM_ALLOW(-0.20)[gmail.com:s=20210112]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[3]; NEURAL_SPAM_SHORT(1.00)[1.000]; NEURAL_HAM_LONG(-1.00)[-1.000]; MIME_GOOD(-0.10)[text/plain]; FREEMAIL_REPLYTO(0.00)[gmail.com]; PREVIOUSLY_DELIVERED(0.00)[freebsd-xen@freebsd.org]; TO_MATCH_ENVRCPT_SOME(0.00)[]; RCVD_IN_DNSWL_NONE(0.00)[2a00:1450:4864:20::130:from]; MLMMJ_DEST(0.00)[freebsd-xen]; RCVD_TLS_ALL(0.00)[] X-ThisMailContainsUnwantedMimeParts: N On 2022.03.21. 13:14, Roger Pau Monné wrote: > On Fri, Mar 18, 2022 at 08:48:08PM +0200, Ze Dupsys wrote: >> On 2022.03.18. 17:24, Roger Pau Monné wrote: >>> On Tue, Mar 15, 2022 at 08:51:57AM +0200, Ze Dupsys wrote: >>>> On 2022.03.14. 11:19, Roger Pau Monné wrote: >>>>> On Mon, Mar 14, 2022 at 10:06:58AM +0200, Ze Dupsys wrote: >>>>>> .. >>>>>> >>>>>> Why those lines starting "xnb(xnb_detach:1330):" do not have any message? >>>>>> Could it be that there is a bad pointer to message buffer that can not be >>>>>> printed? And then sometimes panic happens because access goes out of allowed >>>>>> memory region? >>>>> Some messages in netback are just "\n", likely leftovers from debug. >>>> Okay, found the lines, it is as you say. So this will not be an easy one. >>>> >>>> >>>>> Can you try to stress the system again but this time with guests not >>>>> having any network interfaces? (so that netback doesn't get used in >>>>> dom0). >>>> I'll try to come up with something. At the moment all commands to VMs are >>>> given through ssh. >>>> >>>> >>>>> Then if you could rebuild the FreeBSD dom0 kernel with the above patch >>>>> we might be able to get a bit more of info about blkback shutdown. >>>> I rebuilt 13.1 STABLE, with commenting out #undef and adding #define, thus >>>> line number will differ by single line. For this test i did not remove >>>> network interfaces, and did add DPRINTF messages to xnb_detach function as >>>> well, since i hoped to maybe catch something there, by printing pointers. I >>>> somewhat did not like that xnb_detach does not check for NULL return from >>>> device_get_softc, nor for device_t argument, so i though, maybe those >>>> crashes are something related to that. But i guess this will not be so easy, >>>> and maybe it is safe to assume that "device_t dev" is always valid in that >>>> context. >>>> >>>> So i ran stress test, system did not crash as it happens often when more >>>> debugging info is printed, characteristics change. But it did leak sysctl >>>> xbbd variables. I'll attach all collected log files. sysctl and xl list >>>> commands differ in timing a little bit. xl list _02 is when all VMs are >>>> turned off. Sysctl only has keys without values, not to trigger xnb tests >>>> while reading all values. >>> >>> So I've been staring at this for a while, and I'm not yet sure I >>> figured out exactly what's going on, but can you give a try to the >>> patch below? >>> >>> Thanks, Roger. >>> ---8<--- >>> diff --git a/sys/xen/xenbus/xenbusb.c b/sys/xen/xenbus/xenbusb.c >>> index e026f8203ea1..a8b75f46b9cc 100644 >>> --- a/sys/xen/xenbus/xenbusb.c >>> +++ b/sys/xen/xenbus/xenbusb.c >>> @@ -254,7 +254,7 @@ xenbusb_delete_child(device_t dev, device_t child) >>> static void >>> xenbusb_verify_device(device_t dev, device_t child) >>> { >>> - if (xs_exists(XST_NIL, xenbus_get_node(child), "") == 0) { >>> + if (xs_exists(XST_NIL, xenbus_get_node(child), "state") == 0) { >>> /* >>> * Device tree has been removed from Xenbus. >>> * Tear down the device. >>> >> >> Hi, Roger, >> >> Yes, i will run it, but results will likely be on Monday. >> >> I have done some more digging, but since i do not know how things work >> internally, i have a lot of unknowns and assumptions. I did a lot of DPRINTF >> stuff in various places, i will attach logs, maybe they give better picture. >> >> For easier log reading i filter single log pair with commands: >> sed -i '' -e 's/0xfffff80/0x/g' * >> cat *xen_serial_* | grep -E 'xbb|talkv|xs_read|xs_gather|xenbusb_add_device' >>> xbb.log >> cat *sysctl* | grep -E '^dev.xbb' | grep -Ev >> "\.%location:|\.%pnpinfo:|\.%parent" > xbb_sysctl.log >> >> Then i search for leaked sysctl *.xenstore_path and search that in xbb.log. >> As far as i can tell, cases when it is leaked xbb's have no >> xbb_close_backend call, for whatever reason. Often pointers for leaked xbb >> are a reused ones from previously closed xbb instnaces. >> >> >> My suspicion goes for blkback.c, xbb_attach_disk lines: >> .. >> error = xs_gather(XST_NIL, xenbus_get_node(dev), "physical-device-path", >> NULL, &xbb->dev_name, NULL); >> if (error != 0) >> return; >> .. >> >> The thing is that for those lost xbb's, this is where error = 2, which is >> ENOENT (if i found correct code). >> >> Which is weird, because i assume that xbb_attach_disk function is called >> after "hot-plug script" has finished execution. This is defined in >> xbb_attach(device_t dev): >> .. >> KASSERT(!xbb->hotplug_done, ("Hotplug scripts already executed")); >> watch_path = xs_join(xenbus_get_node(xbb->dev), "physical-device-path"); >> xbb->hotplug_watch.callback_data = (uintptr_t)dev; >> xbb->hotplug_watch.callback = xbb_attach_disk; >> .. >> >> So i added debugging messages at file: /usr/local/etc/xen/scripts/block >> after line "xenstore-write $path/physical-device-path $params", but did not >> see anything suspicious. >> >> One of my ideas is that, maybe for some reason callback is called too ealry, >> because i would expect that when the callback is called, the path >> $path/physical-device-path should exist (is created by script). But then how >> can xs_gather fail with ENOENT there? Something manages to remove >> $path/physical-device-path in the short window between xenstore-write and >> while callback activates? >> >> Another thing is that function xbb_attach_disk on error for "error = >> xs_gather" just returns if != 0. I would expect it to be able to handle >> error gracefully it should flag xbb with an error and then shut down or >> something, but since i do not know internals, it is hard to judge correct >> behaviour in that context. >> >> Another interesting behaviour is that xbb_shutdown is called twice for each >> instance. Once from some function and then from xbb_detach. > > This is expected. On proper domain shutdown the backend will be closed > when the frontend switches to Closing state. Hence the cal from > xbb_detach is for cases where the backend hasn't been closed properly > (ie: domain destroy or stuck frontend). > >> >> My problem of geting more useful info is that i do not understand the flow >> of VM creation, including disk allocation. >> >> For my understanding they just "magically" appear detected with xbb_probe. >> :) >> >> I tried to understand function xs_gather called by xbb_attach, implemented >> by dev/xen/xenstore/xenstore.c, but it just writes packet somewhere; i >> assumed that dev/xen/xenstore/xenstore_dev.c is the one that handles other >> end for that packet, so i looked at "case XS_READ:" in that file, but it >> calls xs_dev_request_and_reply which is implemented by >> dev/xen/xenstore/xenstore.c, so for me it feels like going in circles. So at >> the moment i did not find which component could be the one that sends >> ENOENT. >> >> Could you tell the big picture how VM block dev is created? I mean is it xen >> tools writing request in xenstore_dev, but how it then activates xbb probe? >> Which component is responsible for allocating memory xbb instance? > > PV devices are all handled in xenstore (a.k.a. xenbus) which is a kind > of key-value database that contains the information about devices. > > The FreeBSD kernel sets a watch on certain xenstore paths, so that > when new entries are added a specific function is called on order to > detect whether devices are being added or removed. > >> >> I added printf in dev/xen/xenstore/xenstore.c >> .. >> xs_talkv(struct xs_transaction t, enum xsd_sockmsg_type request_type, >> const struct iovec *iovec, u_int num_vecs, u_int *len, void **result) >> { >> .. >> if (msg.type == XS_ERROR) { >> error = xs_get_error(ret); >> // .. printf here .. >> free(ret, M_XENSTORE); >> .. >> And it prints error 22 a lot (EINVAL, if like in errno.h) and sometimes >> error 2 (this is the one i suppose i saw at mentioned xs_gather call). The >> trouble is that i can not locate the source for xs_talkv error generation + >> when all error messages are printed on serial, it starts to influence >> overall system behaviour, since serial can not keep up with message >> generation speed and as a result sysctl leak does not trigger as fast. >> >> Could i create static variable in xenstore.c and set it to 1 from blkback.c >> when i want logging and to 0 when i don't? Or is it a global component and >> when blkback.c would set's it's logging to 1, all other concurrent messages >> will be logged? >> >> I would not expect xs_talkv to encounter as many error 22 cases, as it does. >> Could this be relevant? Or could it be that it's just by design that some >> functions test if key exists before creating new one? >> >> I stared at xs_talkv function for a while. It calls xs_write_store twice, so >> i thought maybe there could be a case when message is not writen >> "atomically"? Thus error 22. Yes, it does sx_xlock(&xs.request_mutex), but >> function xs_write_store does mtx_lock(&xs.ring_lock); and >> mtx_unlock(&xs.ring_lock); in an interesting manner. OTOH if this really was >> even a bit wrong, then system would be far more unstable maybe even >> unusable, right? >> >> As for now, it seems that actually the third process that was creating zfs >> volumes is not important. What is important is load on system, not load on >> ZVOLs, because i was running VM1 and VM2 on/off stress without zfs create/dd >> script, but was building kernel on parallel, sysctl variables were leaked. >> Kernel was building on UFS ada0, VM disks on ZVOLs ada1. >> >> For today i guess this is it. >> >> Thanks for looking into this and sorry i can not produce more useful error >> logs. Earlier dated has no error code prints. Sysctl variables are taken >> when all VMs are down. > > I think the problem is not likely with the xenstore implementation > (ie: xs_talkv) but rather a race with how the FreeBSD kernel detects > and manages addition and removal of devices that hang off xenbus. > > I'm afraid there's too much data below for me to parse it. Understood. Sounds more tricky than i thought. What could i do to make data more useful? > Let me know whether the proposed patch made any difference. It did not make any difference, system still panics and sysctl leaks. Thanks.