cvs commit: src/sys/dev/io iodev.c
Bruce Evans
brde at optusnet.com.au
Tue Aug 12 14:07:48 UTC 2008
On Tue, 12 Aug 2008, Bruce Evans wrote:
> On Sat, 9 Aug 2008, John Baldwin wrote:
>> You failed to note that not using D_TRACKCLOSE doesn't actually reliable
>> close
>> devices.
>
> No, I noted that vfs doesn't count last closes right. Unreliability of
> last-close is a special case of that. But using D_TRACKCLOSE mostly makes
> things worse. To avoid the vfs bugs in tracked closes, it is necessary
> for individual drivers to understand vfs's bugs better than vfs does.
>
>> I set D_TRACKCLOSE on bpf because under load at work (lots of
>> concurrent nmap's) bpf devices would sometimes never get fully closed, so
>> you'd have an unopened bpf device that was no longer available.
>
> It is unclear how that helps. vfs's bugs must be really large for the
> last-close to go completely missing. kib fixed some of them a few months
> ago (before you changed bpf I think). bpf is especially simple since it
> is exclusive-accesses, so the vfs bugs should be smaller for it.
>
>> D_TRACKCLOSE
>> gives much saner semantics for devices, each open() of an fd calls
>> the 'd_open' method, and the last close of that fd (which could be in
>> another
>> process) calls 'd_close'.
>
> This is insane semantics for most devices, and D_TRACKCLOSE doesn't give it.
> Most devices don't care about the difference between file descriptors and
> files -- they want d_close to be called only when the last file descriptor
> on the device is closed. !D_TRACKCLOSE gives this, modulo the vfs bugs.
>
> D_TRACKCLOSE doesn't give these semantics: revoke() closes everything
> in 1 step and then calls d_close(), giving an N-1 correspondence between
> d_open()s and d_close()s (modulo the vfs bugs giving a fuzzier
> correspondence). (IIRC, at least before kib's fixes, it was possible
> for revoke() to cause any of 0, 1 or 2 calls to d_close(). I think 1
> call always gets as far as devfs_close() but this is only guaranteed
> to reaach d_close() with D_TRACKCLOSE.) This is not a bug, but all
> drivers that depend on the correspondence being 1-1 are broken. Drivers
> that enforce "exclusive" access like bpf have N=1, so they probably
> work modulo the vfs bugs, and they have a good chance of not being
> affected by the vfs bug that gives an extra devfs_close() and thus an
> extra d_close().
I checked that bpf panics (even under UP) due to the obvious bugs in
its d_close():
# Generate lots of network activity using something like:
sysctl net.inet.icmp.icmplim=0; ping -fq localhost &
# Race to panic eventually:
while :; do tcpdump -i lo0 & sleep 0.001; revoke /dev/bpf0
Most or all device drivers have obvious bugs in their d_close(); bpf
is just a bit easier to understand and more likely to cause a panic
than most device drivers, since it is simple and frees resources. A
panic is very likely when si_drv1 is freed, and si_drv1 is only locked
accidentally.
% static int
% bpfclose(struct cdev *dev, int flags, int fmt, struct thread *td)
% {
% struct bpf_d *d = dev->si_drv1;
%
% BPFD_LOCK(d);
% if (d->bd_state == BPF_WAITING)
% callout_stop(&d->bd_callout);
% d->bd_state = BPF_IDLE;
% BPFD_UNLOCK(d);
% funsetown(&d->bd_sigio);
% mtx_lock(&bpf_mtx);
% if (d->bd_bif)
% bpf_detachd(d);
% mtx_unlock(&bpf_mtx);
% selwakeuppri(&d->bd_sel, PRINET);
% #ifdef MAC
% mac_bpfdesc_destroy(d);
% #endif /* MAC */
% knlist_destroy(&d->bd_sel.si_note);
% bpf_freed(d);
% dev->si_drv1 = NULL;
% free(d, M_BPF);
Lots of possibly-in-used data structures are destroyed here.
%
% return (0);
% }
%
% static int
% bpfread(struct cdev *dev, struct uio *uio, int ioflag)
% {
% struct bpf_d *d = dev->si_drv1;
There is no locking for si_drv1 here and elsewhere. There used to be
locking -- long ago, there was implicit locking for non-SMP non-preemptive
kernels. Then there was Giant locking.
d can be either NULL or garbage here. (It is null if revoke() just
cleared si_drv1 and there is an implicit or accidental memory barrier
that makes the change visible here. It is garbage if revoke() has
started destroying it but hasn't yet set si_drv1 to NULL when we read
si_drv1, or if revoke() starts destroying it after we read si_drv1.)
% int timed_out;
% int error;
%
% /*
% * Restrict application to use a buffer the same size as
% * as kernel buffers.
% */
% if (uio->uio_resid != d->bd_bufsize)
% return (EINVAL);
If we're lucky after losing the race, then d is null and we get a null
pointer panic here. I didn't observe this.
%
% BPFD_LOCK(d);
If we're unlucky after losing the race, then we start corrupting *d here.
% error = msleep(d, &d->bd_mtx, PRINET|PCATCH,
% "bpf", d->bd_rtout);
Since we block, we implemented panic(9) even with implicit or explicit
Giant locking. The sleep gives a very large race window...
% if (error == EINTR || error == ERESTART) {
% BPFD_UNLOCK(d);
% return (error);
% }
...however, we now have locking for d, and bpfclose() blocks on this,
so I think we don't implement panic(9) here. We have a broken revoke()
instead -- it should do a forced close without waiting, but instead it
waits for the unlock in the above or later.
% if (error == EWOULDBLOCK) {
% /*
% * On a timeout, return what's in the buffer,
% * which may be nothing. If there is something
% * in the store buffer, we can rotate the buffers.
% */
% if (d->bd_hbuf)
% /*
% * We filled up the buffer in between
% * getting the timeout and arriving
% * here, so we don't need to rotate.
% */
% break;
%
% if (d->bd_slen == 0) {
% BPFD_UNLOCK(d);
% return (0);
% }
% ROTATE_BUFFERS(d);
% break;
% }
% }
% /*
% * At this point, we know we have something in the hold slot.
% */
% BPFD_UNLOCK(d);
If bpfclose() was blocked on d, then our d is especially likely to become
corrupt just after here.
%
% /*
% * Move data from hold buffer into user space.
% * We know the entire buffer is transferred since
% * we checked above that the read buffer is bpf_bufsize bytes.
% *
% * XXXRW: More synchronization needed here: what if a second thread
% * issues a read on the same fd at the same time? Don't want this
% * getting invalidated.
% */
% error = bpf_uiomove(d, d->bd_hbuf, d->bd_hlen, uio);
%
% BPFD_LOCK(d);
% d->bd_fbuf = d->bd_hbuf;
% d->bd_hbuf = NULL;
% d->bd_hlen = 0;
% bpf_buf_reclaimed(d);
% BPFD_UNLOCK(d);
%
% return (error);
% }
Bruce
More information about the cvs-src
mailing list