Re: Possible bug in zfs send or pipe implementation?

From: Mark Johnston <markj_at_freebsd.org>
Date: Mon, 15 Jul 2024 14:50:12 UTC
On Sun, Jul 14, 2024 at 03:14:44PM -0700, Rick Macklem wrote:
> On Sun, Jul 14, 2024 at 10:32 AM Garrett Wollman <wollman@bimajority.org>
> wrote:
> 
> > <<On Sat, 13 Jul 2024 20:50:55 -0700, Rick Macklem <rick.macklem@gmail.com>
> > said:
> >
> > > Just to clarify it, are you saying zfs is sleeping on "pipewr"?
> > > (There is also a msleep() for "pipbww" in pipe_write().)
> >
> > It is sleeping on pipewr, yes.
> >
> > [wollman@nfs-prod-11 ~]$ sysctl kern.ipc.pipekva
> > kern.ipc.pipekva: 536576
> > [wollman@nfs-prod-11 ~]$ sysctl kern.ipc.maxpipekva
> > kern.ipc.maxpipekva: 2144993280
> >
> > It's not out of KVA, it's just waiting for the `pv` process to wake up
> > and read more data.  `pv` is single-threaded and blocked on "select".
> >
> > It doesn't always get stuck in the same place, which is why I'm
> > suspecting a lost wakeup somewhere.
> >
> This snippet from sys/kern/sys_pipe.c looks a little suspicious to me...
>   /*
> * Direct copy, bypassing a kernel buffer.
> */
> } else if ((size = rpipe->pipe_pages.cnt) != 0) {
> if (size > uio->uio_resid)
> size = (u_int) uio->uio_resid;
> PIPE_UNLOCK(rpipe);
> error = uiomove_fromphys(rpipe->pipe_pages.ms,
>     rpipe->pipe_pages.pos, size, uio);
> PIPE_LOCK(rpipe);
> if (error)
> break;
> nread += size;
> rpipe->pipe_pages.pos += size;
> rpipe->pipe_pages.cnt -= size;
> if (rpipe->pipe_pages.cnt == 0) {
> rpipe->pipe_state &= ~PIPE_WANTW;
> wakeup(rpipe);
> }
> If it reads uio_resid bytes which is less than pipe_pages.cnt, no
> wakeup() occurs.
> I'd be tempted to try getting rid of the "if (rpipe->pipe_pages.cnt == 0)"
> and do the wakeup() unconditionally, to see if it helps?

I don't think that can help.  pipe_write() will block if the "direct
write" buffer is non-empty.  See the comment in pipe_write(), "Pipe
buffered writes cannot be coincidental with direct writes".

select()/poll()/etc. should return an event if pipe_pages.cnt > 0 on the
read side, so I suspect that the problem is elsewhere, or else I'm
misunderstanding something.

> Because if the application ("pv" in this case) doesn't do another read() on
> the
> pipe before calling select(), no wakeup() is going to occur, because here's
> what pipe_write() does...
> /*
> * We have no more space and have something to offer,
> * wake up select/poll.
> */
> pipeselwakeup(wpipe);
> 
> wpipe->pipe_state |= PIPE_WANTW;
> pipeunlock(wpipe);
> error = msleep(wpipe, PIPE_MTX(rpipe),
>     PRIBIO | PCATCH, "pipewr", 0);
> pipelock(wpipe, 0);
> if (error != 0)
> break;
> continue;
> Note that, once in msleep(), no call to pipeselwakeup() will occur until
> it gets woken up.
> 
> I think the current code assumes that the reader ("pv" in this case) will
> read all the data out of the pipe before calling select() again.
> Does it do that?
> 
> rick
> ps: I've added markj@ as a cc, since he seems to have been the last guy
> involved
>     in sys_pipe.c.