Re: Possible bug in zfs send or pipe implementation?
- In reply to: Rick Macklem : "Re: Possible bug in zfs send or pipe implementation?"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Mon, 15 Jul 2024 14:50:12 UTC
On Sun, Jul 14, 2024 at 03:14:44PM -0700, Rick Macklem wrote: > On Sun, Jul 14, 2024 at 10:32 AM Garrett Wollman <wollman@bimajority.org> > wrote: > > > <<On Sat, 13 Jul 2024 20:50:55 -0700, Rick Macklem <rick.macklem@gmail.com> > > said: > > > > > Just to clarify it, are you saying zfs is sleeping on "pipewr"? > > > (There is also a msleep() for "pipbww" in pipe_write().) > > > > It is sleeping on pipewr, yes. > > > > [wollman@nfs-prod-11 ~]$ sysctl kern.ipc.pipekva > > kern.ipc.pipekva: 536576 > > [wollman@nfs-prod-11 ~]$ sysctl kern.ipc.maxpipekva > > kern.ipc.maxpipekva: 2144993280 > > > > It's not out of KVA, it's just waiting for the `pv` process to wake up > > and read more data. `pv` is single-threaded and blocked on "select". > > > > It doesn't always get stuck in the same place, which is why I'm > > suspecting a lost wakeup somewhere. > > > This snippet from sys/kern/sys_pipe.c looks a little suspicious to me... > /* > * Direct copy, bypassing a kernel buffer. > */ > } else if ((size = rpipe->pipe_pages.cnt) != 0) { > if (size > uio->uio_resid) > size = (u_int) uio->uio_resid; > PIPE_UNLOCK(rpipe); > error = uiomove_fromphys(rpipe->pipe_pages.ms, > rpipe->pipe_pages.pos, size, uio); > PIPE_LOCK(rpipe); > if (error) > break; > nread += size; > rpipe->pipe_pages.pos += size; > rpipe->pipe_pages.cnt -= size; > if (rpipe->pipe_pages.cnt == 0) { > rpipe->pipe_state &= ~PIPE_WANTW; > wakeup(rpipe); > } > If it reads uio_resid bytes which is less than pipe_pages.cnt, no > wakeup() occurs. > I'd be tempted to try getting rid of the "if (rpipe->pipe_pages.cnt == 0)" > and do the wakeup() unconditionally, to see if it helps? I don't think that can help. pipe_write() will block if the "direct write" buffer is non-empty. See the comment in pipe_write(), "Pipe buffered writes cannot be coincidental with direct writes". select()/poll()/etc. should return an event if pipe_pages.cnt > 0 on the read side, so I suspect that the problem is elsewhere, or else I'm misunderstanding something. > Because if the application ("pv" in this case) doesn't do another read() on > the > pipe before calling select(), no wakeup() is going to occur, because here's > what pipe_write() does... > /* > * We have no more space and have something to offer, > * wake up select/poll. > */ > pipeselwakeup(wpipe); > > wpipe->pipe_state |= PIPE_WANTW; > pipeunlock(wpipe); > error = msleep(wpipe, PIPE_MTX(rpipe), > PRIBIO | PCATCH, "pipewr", 0); > pipelock(wpipe, 0); > if (error != 0) > break; > continue; > Note that, once in msleep(), no call to pipeselwakeup() will occur until > it gets woken up. > > I think the current code assumes that the reader ("pv" in this case) will > read all the data out of the pipe before calling select() again. > Does it do that? > > rick > ps: I've added markj@ as a cc, since he seems to have been the last guy > involved > in sys_pipe.c.