Re: Possible bug in zfs send or pipe implementation?
Date: Sun, 14 Jul 2024 22:14:44 UTC
On Sun, Jul 14, 2024 at 10:32 AM Garrett Wollman <wollman@bimajority.org> wrote: > <<On Sat, 13 Jul 2024 20:50:55 -0700, Rick Macklem <rick.macklem@gmail.com> > said: > > > Just to clarify it, are you saying zfs is sleeping on "pipewr"? > > (There is also a msleep() for "pipbww" in pipe_write().) > > It is sleeping on pipewr, yes. > > [wollman@nfs-prod-11 ~]$ sysctl kern.ipc.pipekva > kern.ipc.pipekva: 536576 > [wollman@nfs-prod-11 ~]$ sysctl kern.ipc.maxpipekva > kern.ipc.maxpipekva: 2144993280 > > It's not out of KVA, it's just waiting for the `pv` process to wake up > and read more data. `pv` is single-threaded and blocked on "select". > > It doesn't always get stuck in the same place, which is why I'm > suspecting a lost wakeup somewhere. > This snippet from sys/kern/sys_pipe.c looks a little suspicious to me... /* * Direct copy, bypassing a kernel buffer. */ } else if ((size = rpipe->pipe_pages.cnt) != 0) { if (size > uio->uio_resid) size = (u_int) uio->uio_resid; PIPE_UNLOCK(rpipe); error = uiomove_fromphys(rpipe->pipe_pages.ms, rpipe->pipe_pages.pos, size, uio); PIPE_LOCK(rpipe); if (error) break; nread += size; rpipe->pipe_pages.pos += size; rpipe->pipe_pages.cnt -= size; if (rpipe->pipe_pages.cnt == 0) { rpipe->pipe_state &= ~PIPE_WANTW; wakeup(rpipe); } If it reads uio_resid bytes which is less than pipe_pages.cnt, no wakeup() occurs. I'd be tempted to try getting rid of the "if (rpipe->pipe_pages.cnt == 0)" and do the wakeup() unconditionally, to see if it helps? Because if the application ("pv" in this case) doesn't do another read() on the pipe before calling select(), no wakeup() is going to occur, because here's what pipe_write() does... /* * We have no more space and have something to offer, * wake up select/poll. */ pipeselwakeup(wpipe); wpipe->pipe_state |= PIPE_WANTW; pipeunlock(wpipe); error = msleep(wpipe, PIPE_MTX(rpipe), PRIBIO | PCATCH, "pipewr", 0); pipelock(wpipe, 0); if (error != 0) break; continue; Note that, once in msleep(), no call to pipeselwakeup() will occur until it gets woken up. I think the current code assumes that the reader ("pv" in this case) will read all the data out of the pipe before calling select() again. Does it do that? rick ps: I've added markj@ as a cc, since he seems to have been the last guy involved in sys_pipe.c. > -GAWollman > >