14.0-RELEASE and 15-CURRENT have high interrupt count/latency in Xen domu (xcp-ng host)
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Sun, 10 Mar 2024 00:27:03 UTC
On an xcp-ng VM running 14.0-RELEASE (also tested on p5 and 15-CURRENT), top shows a lot of CPU time in interrupts (always a double-digit percentage, usually around 70% but it varies all over the place). Based on dtrace, the problem appears to be in ata_interrupt and I’d guess it’s mutex-related. What else do I want to do to investigate this issue? All of below is on an instance with 1GB RAM and 2 vCPUs; the host is xcp-ng 8.2 with all available patches. dtrace -n '::*interrupt*:entry { self->ts = timestamp; } ::*interrupt*:return /self->ts/ { @[probefunc] = quantize(timestamp - self->ts); self->ts = 0; }' run for one minute on an otherwise idle VM produces: zio_delay_interrupt value ------------- Distribution ------------- count 512 | 0 1024 |@@@@@@@ 2 2048 |@@@ 1 4096 | 0 8192 |@@@@@@@@@@@@@@@@@@@@ 6 16384 |@@@@@@@@@@ 3 32768 | 0 ata_interrupt value ------------- Distribution ------------- count 65536 | 0 131072 |@@@ 3 262144 |@@@@@@@@@@@@@@@@@ 19 524288 |@@@@@@@@ 9 1048576 |@@@@@@@@@@@@ 13 2097152 | 0 On CURRENT, that dtrace invocation returns nothing at all; dtrace -l shows an entry but not return for ata_interrupt so that appears to be an fbt issue. The disk does appear to be using the PV driver: xbd0: 32768MB <Virtual Block Device> at device/vbd/768 on xenbusb_front0 xbd0: attaching as ada0 xbd0: features: write_barrier xbd0: synchronize cache commands enabled. There are two mutex warnings that show up on CURRENT, but they seem unrelated: Invoking IPv6 network device address event may sleep with the following non-sleepable locks held: exclusive sleep mutex xnrx_1 (netfront receive lock) r = 0 (0xfffffe00496d2d48) locked @ /usr/src/sys/dev/xen/netfront/netfront.c:679 stack backtrace: #0 0xffffffff80bc9a35 at witness_debugger+0x65 #1 0xffffffff80bcab99 at witness_warn+0x3e9 #2 0xffffffff80d793c4 at in6_update_ifa+0xcb4 #3 0xffffffff80da4f80 at in6_ifadd+0x1e0 #4 0xffffffff80da16d9 at nd6_ra_input+0x1029 #5 0xffffffff80d73de0 at icmp6_input+0x870 #6 0xffffffff80d8c4ae at ip6_input+0xc8e #7 0xffffffff80cadbbd at netisr_dispatch_src+0xad #8 0xffffffff80c9065a at ether_demux+0x16a #9 0xffffffff80c91d73 at ether_nh_input+0x403 #10 0xffffffff80cadbbd at netisr_dispatch_src+0xad #11 0xffffffff80c90ab9 at ether_input+0xd9 #12 0xffffffff809c6ebc at xn_rxeof+0x5dc #13 0xffffffff809c8238 at xn_intr+0x48 #14 0xffffffff80b0d556 at ithread_loop+0x266 #15 0xffffffff80b09972 at fork_exit+0x82 #16 0xffffffff8102e06e at fork_trampoline+0xe lock order reversal: (sleepable after non-sleepable) 1st 0xfffffe00496d2d48 xnrx_1 (netfront receive lock, sleep mutex) @ /usr/src/sys/dev/xen/netfront/netfront.c:679 2nd 0xffffffff81abe830 in6_multi_sx (in6_multi_sx, sx) @ /usr/src/sys/netinet6/in6_mcast.c:1217 lock order netfront receive lock -> in6_multi_sx attempted at: #0 0xffffffff80bc95f3 at witness_checkorder+0xb63 #1 0xffffffff80b5f0fd at _sx_xlock+0x5d #2 0xffffffff80d81b11 at in6_joingroup+0x31 #3 0xffffffff80d7977d at in6_update_ifa+0x106d #4 0xffffffff80da4f80 at in6_ifadd+0x1e0 #5 0xffffffff80da16d9 at nd6_ra_input+0x1029 #6 0xffffffff80d73de0 at icmp6_input+0x870 #7 0xffffffff80d8c4ae at ip6_input+0xc8e #8 0xffffffff80cadbbd at netisr_dispatch_src+0xad #9 0xffffffff80c9065a at ether_demux+0x16a #10 0xffffffff80c91d73 at ether_nh_input+0x403 #11 0xffffffff80cadbbd at netisr_dispatch_src+0xad #12 0xffffffff80c90ab9 at ether_input+0xd9 #13 0xffffffff809c6ebc at xn_rxeof+0x5dc #14 0xffffffff809c8238 at xn_intr+0x48 #15 0xffffffff80b0d556 at ithread_loop+0x266 #16 0xffffffff80b09972 at fork_exit+0x82 #17 0xffffffff8102e06e at fork_trampoline+0xe (The "non-sleepable locks held” backtrace shows up a second time but without a lock reversal.) systat -vmstat on RELEASE: 3 users Load 0.13 0.19 0.22 Mar 10 00:21:55 Mem usage: 89%Phy 78%Kmem VN PAGER SWAP PAGER Mem: REAL VIRTUAL in out in out Tot Share Tot Share Free count Act 47944K 20684K 406M 24480K 107M pages All 50736K 23368K 446M 57372K ioflt Interrupts Proc: cow 69 total r p d s w Csw Trp Sys Int Sof Flt zfod atkbd0 1 31 91 2 67 36 1 ozfod ata1 15 %ozfod uhci0 23 0.0%Sys 78.9%Intr 0.0%User 0.0%Nice 21.1%Idle daefr 17 cpu0:xen | | | | | | | | | | | prcfr 17 cpu1:xen +++++++++++++++++++++++++++++++++++++++ totfr cpu0:r dtbuf react cpu0:itlb Namei Name-cache Dir-cache 62482 maxvn pdwak 4 cpu0:b Calls hits % hits % 17168 numvn 5 pdpgs cpu1:r 7 7 100 16002 frevn intrn cpu1:itlb 772M wire 6 cpu1:b Disks ada0 cd0 pass0 13M act 13 xen_et0:c0 KB/t 0.00 0.00 0.00 60M inact 11 xen_et0:c1 tps 0 0 0 0 laund xenstore0 MB/s 0.00 0.00 0.00 107M free xbd0 2113 %busy 0 0 0 0 buf 1 xn0 2114 xn0 2115 xn0 2116 xn0 2117 and on CURRENT: 4 users Load 0.15 0.12 0.16 Mar 10 00:23:11 Mem usage: 35%Phy 8%Kmem VN PAGER SWAP PAGER Mem: REAL VIRTUAL in out in out Tot Share Tot Share Free count Act 103M 14608K 454M 16848K 626M pages All 105M 16908K 486M 49620K ioflt Interrupts Proc: cow 274 total r p d s w Csw Trp Sys Int Sof Flt zfod atkbd0 1 29 272 1 49 137 1 ozfod uhci0 23 %ozfod 114 cpu0:xen 0.0%Sys 56.0%Intr 0.0%User 0.0%Nice 44.0%Idle daefr 23 cpu1:xen | | | | | | | | | | | prcfr 2 cpu0:r ++++++++++++++++++++++++++++ totfr cpu0:itlb 6 dtbuf react 12 cpu0:b Namei Name-cache Dir-cache 62833 maxvn pdwak 1 cpu1:r Calls hits % hits % 920 numvn 10 pdpgs cpu1:itlb 7 7 100 126 frevn intrn 17 cpu1:b 217M wire 100 xen_et0:c0 Disks ada0 31M act 4 xen_et0:c1 KB/t 0.00 84M inact xenstore0 tps 0 0 laund xbd0 2113 MB/s 0.00 626M free 1 xn0 2114 %busy 0 80M buf xn0 2115 xn0 2116 xn0 2117 — Brad Ackerman brad@facefault.org