close() of an flock'd file is not atomic
John Baldwin
jhb at freebsd.org
Fri Mar 16 18:06:34 UTC 2012
On Friday, March 09, 2012 10:59:29 am John Baldwin wrote:
> On Thursday, March 08, 2012 5:39:19 pm Konstantin Belousov wrote:
> > On Thu, Mar 08, 2012 at 03:39:07PM -0500, John Baldwin wrote:
> > > On Wednesday, March 07, 2012 1:18:07 pm John Baldwin wrote:
> > > > So I ran into this problem at work. Suppose you have a process that opens a
> > > > read-write file descriptor with O_EXLOCK (so it has an flock()). It then
> > > > writes out a binary into that file. Another process wants to execve() the
> > > > file when it is ready, so it opens the file with O_EXLOCK (or O_SHLOCK), and
> > > > will call execve() once it has locked the file. In theory, what should happen
> > > > is that the second process should wait until the first process has finished
> > > > and called close(). In practice what happens is that I occasionally see the
> > > > second process fail with ETXTBUSY.
> > > >
> > > > The bug is that the vn_closefile() does the VOP_ADVLOCK() to unlock the file
> > > > separately from the call to vn_close() which drops the writecount. Thus, the
> > > > second process can do an open() and flock() of the file and subsequently call
> > > > execve() after the first process has done the VOP_ADVLOCK(), but before it
> > > > calls into vn_close(). In fact, since vn_close() requires a write lock on the
> > > > vnode, this turns out to not be too hard to reproduce at all. Below is a
> > > > simple test program that reproduces this constantly. To use, copy /bin/test
> > > > to some other file (e.g. /tmp/foo) and make it writable (chmod a+w), then run
> > > > ./flock_close_race /tmp/foo.
> > > >
> > > > The "fix" I came up with is to defer calling VOP_ADVLOCK() to release the lock
> > > > until after vn_close() executes. However, even with that fix applied, my test
> > > > case still fails. Now it is because open() with a given lock flag is
> > > > non-atomic in that the open(O_RDWR) will call vn_open() and bump v_writecount
> > > > before it blocks on the lock due to O_EXLOCK, so even though the 'exec_child'
> > > > process has the fd locked, the writecount can still be bumped. One gross hack
> > > > would be to defer the bump of the writecount to the caller of vn_open() if the
> > > > caller passes in O_EXLOCK or O_SHLOCK, but that's a really gross kludge, plus
> > > > it doesn't actually work. I ended up moving acquiring the lock into
> > > > vn_open_cred(). The current patch I'm testing has both of these approaches,
> > > > but the first one is #if 0'd out, and the second is #if 1'd.
> > > >
> > > > http://www.freebsd.org/~jhb/patches/flock_open_close.patch
> > >
> > > Based on some feedback from Konstantin, I've fixed some issues in the failure
> > > path handling for VOP_ADVLOCK(). I've also removed the #if 0'd code mentioned
> > > above, so the patch is now the actual change that I'm testing. So far it
> > > handles both my workload at work and my test program without any issues.
> >
> > I think a comment is needed for a reason to call vn_writechk() second time.
>
> Fixed.
>
> > Could you, please, point me, where the FHASLOCK is set for O_EXLOCK | O_SHLOCK
> > case in the patched kernel ?
>
> It wasn't. :( I wonder how this was even working since close shouldn't have
> been unlocking. I'll need to do some more testing. BTW, I ran into fhopen()
> and found that I would need to put all this same logic into that, so I've split
> the common code from fhopen() and vn_open_cred() into a new vn_open_vnode().
> I think in general it improves both sets of code.
>
> I'll upate the patch once I've done some more testing.
So the current patch I'm testing is at
http://www.FreeBSD.org/~jhb/patches/flock_open_close2.patch
This includes the aforementioned vn_open_vnode(). There are a few #if 0
debugging nits that would be removed before committing. The one in fhopen()
is an open question if it should be non-restartable the way open() is. I've
also had to update my test case to work around a race between it's two
"threads" where the read/write open could complete before the other child had
finished exit() resulting in a correct ETXTBSY error. The first process just
ignores ETXTBSY for errors to workaround. It is included below.
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/sysctl.h>
#include <sys/wait.h>
#include <err.h>
#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
static void
usage(void)
{
fprintf(stderr, "Usage: flock_close_race <binary> [args]\n");
exit(1);
}
static void
child(const char *binary)
{
int fd;
/* Exit as soon as our parent exits. */
while (getppid() != 1) {
fd = open(binary, O_RDWR | O_EXLOCK);
if (fd < 0) {
/*
* This may get ETXTBSY since exit() will
* close its open fd's (thus releasing the
* lock), before it releases the vmspace (and
* mapping of the binary).
*/
if (errno == ETXTBSY)
continue;
err(1, "can't open %s", binary);
}
close(fd);
}
exit(0);
}
static void
exec_child(char **av)
{
int fd, value, save;
fd = open(av[0], O_RDONLY | O_SHLOCK);
execv(av[0], av);
save = errno;
value = 0;
sysctlbyname("debug.ktr.mask", NULL, NULL, &value, sizeof(value));
errno = save;
err(127, "execv");
}
int
main(int ac, char **av)
{
struct stat sb;
pid_t pid;
if (ac < 2)
usage();
if (stat(av[1], &sb) != 0)
err(1, "stat(%s)", av[1]);
if (!S_ISREG(sb.st_mode))
errx(1, "%s not an executable", av[1]);
pid = fork();
if (pid < 0)
err(1, "fork");
if (pid == 0)
child(av[1]);
for (;;) {
pid = fork();
if (pid < 0)
err(1, "vfork");
if (pid == 0)
exec_child(av + 1);
wait(NULL);
}
return (0);
}
--
John Baldwin
More information about the freebsd-fs
mailing list