Some kind of race condition in adding and removing domu's causes vm zombies
Date: Fri, 24 Jun 2022 01:30:56 UTC
hello. I don't have a lot more details on the issue, but under xen-4.15 and xen-4.16 with freeBSD-12 and FreeBSD-13, it's pretty easy to end up with zombie domu's that are unkillable and unrestartable. Even worse, the block devices associated with these not-quite-gone domus' are unusable with other domu's without an entire system reboot. How to reproduce: 1. Shutdown a vm that's currently running, I'm using NetBSD, but FreeBSD domus' wil demonstrate this behavior as well. 2. If auto-restart is set in the domu's conf file, the domu will restart with a new domain id. 3. Just as the newly restarted domu is coming up, issue: xl destroy <domid-of-newly-started-domain> You may see output like the following: root# xl destroy 20 libxl: error: libxl_device.c:1111:device_backend_callback: Domain 20:unable to remove device with pa th /local/domain/0/backend/vbd/20/768 libxl: error: libxl_device.c:1111:device_backend_callback: Domain 20:unable to remove device with pa th /local/domain/0/backend/vif/20/0 libxl: error: libxl_domain.c:1530:devices_destroy_cb: Domain 20:libxl__devices_destroy failed Now, issue: #xl list (null) 20 0 1 --p--d 2083.7 The work around I've found for this issue is to shutdown the domu with the -h flag, causing the system to wait for a final keypress on the console before rebooting. Then, while it's waiting, issue the xl destroy command on the old, waiting, domain ID. this work around will prevent the issue, but it's my view that I shouldn't be able to wedge the destruction process in this way such that the entire machine needs to be restarted. Being able to do this makes the system rather fragile. -thanks -Brian