Re: Some kind of race condition in adding and removing domu's causes vm zombies
- In reply to: Brian Buhrow : "Some kind of race condition in adding and removing domu's causes vm zombies"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Tue, 28 Jun 2022 11:38:59 UTC
On Thu, Jun 23, 2022 at 06:30:56PM -0700, Brian Buhrow wrote: > hello. I don't have a lot more details on the issue, but under xen-4.15 and xen-4.16 with > freeBSD-12 and FreeBSD-13, it's pretty easy to end up with zombie domu's that are unkillable > and unrestartable. Even worse, the block devices associated with these not-quite-gone domus' > are unusable with other domu's without an entire system reboot. > > How to reproduce: > > 1. Shutdown a vm that's currently running, I'm using NetBSD, but FreeBSD domus' wil > demonstrate this behavior as well. > > > 2. If auto-restart is set in the domu's conf file, the domu will restart with a new domain id. > > 3. Just as the newly restarted domu is coming up, issue: > xl destroy <domid-of-newly-started-domain> > > You may see output like the following: > > root# xl destroy 20 > libxl: error: libxl_device.c:1111:device_backend_callback: Domain 20:unable to remove device > with pa > th /local/domain/0/backend/vbd/20/768 > libxl: error: libxl_device.c:1111:device_backend_callback: Domain 20:unable to remove device > with pa > th /local/domain/0/backend/vif/20/0 > libxl: error: libxl_domain.c:1530:devices_destroy_cb: Domain 20:libxl__devices_destroy failed > > Now, issue: > #xl list > (null) 20 0 1 --p--d 2083.7 > > The work around I've found for this issue is to shutdown the domu with the -h flag, causing the > system to wait for a final keypress on the console before rebooting. Then, while it's waiting, > issue the xl destroy command on the old, waiting, domain ID. > > this work around will prevent the issue, but it's my view that I shouldn't be able to wedge the > destruction process in this way such that the entire machine needs to be restarted. Being able > to do this makes the system rather fragile. Hm, I don't seem to be able to reproduce this on HEAD. Could you give a try to a HEAD kernel and see whether you can reproduce? (keep the same userland, that should be fine). Thanks, Roger.