kern/72041: Deadlock when disk is destroyed while user process
closes
Brian Eng
brian at midstream.com
Thu Sep 23 11:30:27 PDT 2004
>Number: 72041
>Category: kern
>Synopsis: Deadlock when disk is destroyed while user process closes
>Confidential: no
>Severity: critical
>Priority: medium
>Responsible: freebsd-bugs
>State: open
>Quarter:
>Keywords:
>Date-Required:
>Class: sw-bug
>Submitter-Id: current-users
>Arrival-Date: Thu Sep 23 18:30:27 GMT 2004
>Closed-Date:
>Last-Modified:
>Originator: Brian Eng
>Release: 5.2.1-RELEASE
>Organization:
MidStream
>Environment:
FreeBSD lexington.midstream.com 5.2.1-RELEASE FreeBSD 5.2.1-RELEASE #9: Thu Sep 2 14:23:04 PDT 2004 brian at lexington.midstream.com:/usr/src/sys/i386/compile/BRIAN i386
>Description:
The deadlock is between the geom code and the cam code. It occurred when a fibre channel cable was removed when a user process was still accessing a disk through it.
The system is set up to do a 'camcontrol rescan' upon indication from the HBA driver that the storage devices in the system may have changed. 'camcontrol rescan' triggers a succession of SCSI commands that are driven by the cambio/camisr() software interrupt. When the cable was unplugged, this led to cambio calling disk_destroy() on the disks that were now lost. disk_destroy() led to an attempt to acquire topology_lock() in the g_event thread.
Meanwhile, the user app (dd) received an I/O error and closed the device. This led to a call to g_dev_close(), which acquired topology_lock() and then went down to daclose(), which sent a SCSI SYNC_CACHE command and waited for the command to complete.
The SYNC_CACHE command completes, but the syscall is never told by cambio, which is frozen waiting for the lock that the syscall is holding.
>How-To-Repeat:
Do 'camcontrol rescan' either continuously or upon driver notification of changes. Set up a bunch of processes (I was using 'dd') to read a removable disk, then remove it while the processes are running.
There may also be a scenario with disk_create.
>Fix:
One perspective on this is that cambio inverted the layers; normally, geom code calls cam code, but in the 'camcontrol rescan' case, cam code calls geom code, resulting in locks being taken in opposite order. Perhaps disk_destroy could just queue to g_event and not wait for completion.
>Release-Note:
>Audit-Trail:
>Unformatted:
More information about the freebsd-bugs
mailing list