From nobody Sun Jul 14 22:14:44 2024 X-Original-To: freebsd-stable@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4WMflN6MG4z5RWsp for ; Sun, 14 Jul 2024 22:14:56 +0000 (UTC) (envelope-from rick.macklem@gmail.com) Received: from mail-io1-xd2c.google.com (mail-io1-xd2c.google.com [IPv6:2607:f8b0:4864:20::d2c]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "WR4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4WMflN4Yfzz55dk; Sun, 14 Jul 2024 22:14:56 +0000 (UTC) (envelope-from rick.macklem@gmail.com) Authentication-Results: mx1.freebsd.org; none Received: by mail-io1-xd2c.google.com with SMTP id ca18e2360f4ac-7fedb09357aso198445639f.2; Sun, 14 Jul 2024 15:14:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1720995295; x=1721600095; darn=freebsd.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=LFIOPX4B/MPsQviTAnc5CYWLYQdbUC2pWO+49/o0Dc4=; b=Rsh2HFIt+LjuGEbx0MgZXdb0ZS1xMQdpAZM7sH1QH1htAPMLCa3HPx/SzQoXQvkPia 5Y9yz5y7xSnAKVhEfR5WaJuiIHIGKo3gDe22Uu6R5jOVxh5qK5fBjONIJIhVF6fM9wSg 3dcPEg5+5ZCmDPCjaIj/m1veMvmo9cP1I622DwRApljTqiKVXi36D8FNOs4NZ+C6gs8H f0roBrt+s8DhKCcbEIsZvJMDFixelObXthrEvW1yW4Bk47NNPt5TfFnnjW2RBNSVESFD m1nxxPYrXCZQmEhMdRyZk4tOsHpY2eCCvQUnS7bXqjTefueHNZk7uj/2SSViYMqjl/VB AHnw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1720995295; x=1721600095; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=LFIOPX4B/MPsQviTAnc5CYWLYQdbUC2pWO+49/o0Dc4=; b=a3ZXM2qq8JF6hr94zo4p+sSx5mxlr30MchNTYjyv7A6K5Vm13OMaV9b6dKTjadlmOU pZZlickiQ0c2uM68jobMFKjXQbbm6ns/HcBTG5kREb95CwpdAzKcp5L+jGSTtlWGVR5C N6MzRTOrj6zdSrmRWknq2gahn4o50fbwI6Psq/m2km1wM06kvp9f/xKpj9Nbd31dzSgY Fs1E0C/usu60m6swLs3UfXEP7T5U/TaoUPGQ/mS0C2Kj0TzNLl3RQSJJHgovdznNcEav 0md+WqrTqWPma+6buVMyDMr4xM/yaeGUoD+ZdvdXHIcLiodDSXKV7vlwjcGjIKO7Ilmo qF7g== X-Forwarded-Encrypted: i=1; AJvYcCWP01MofGAnvAeISbp8uoMtcvOPvAkqCKO5VjPi4E3ZrEVykM5fNJJcBtqUzaa7w18f6L57ufABk2I+OqrFWkP3+Q== X-Gm-Message-State: AOJu0Yyxaf6OQ/B2nzA3/wSbmrSZQ9mH2oxxWjAdEMBcoYAj9uGHF+Qk 3IuBzbAPceEGBSEHyqSezPeyZ2kjvWNkqzVSB0ecij8q3aTYP3sor2Tt4yErZ+yjcNHcsXviCZv a9qYZkMWoGucMsBSSdtnGD2yVwA== X-Google-Smtp-Source: AGHT+IEDZu71htSgNqq/4PqBDAcTuH50o3ybksfNyw8R8qIEBYX5wvnao8kLxz8AyNQk+h78wlZzBlY35V5C1WhUOk4= X-Received: by 2002:a6b:f20c:0:b0:804:657d:92fc with SMTP id ca18e2360f4ac-804657d94a6mr1515371839f.18.1720995295198; Sun, 14 Jul 2024 15:14:55 -0700 (PDT) List-Id: Production branch of FreeBSD source code List-Archive: https://lists.freebsd.org/archives/freebsd-stable List-Help: List-Post: List-Subscribe: List-Unsubscribe: X-BeenThere: freebsd-stable@freebsd.org Sender: owner-freebsd-stable@FreeBSD.org MIME-Version: 1.0 References: <26259.12713.114036.564205@hergotha.csail.mit.edu> <26259.17366.276955.824313@hergotha.csail.mit.edu> <26260.2984.961319.782123@hergotha.csail.mit.edu> In-Reply-To: <26260.2984.961319.782123@hergotha.csail.mit.edu> From: Rick Macklem Date: Sun, 14 Jul 2024 15:14:44 -0700 Message-ID: Subject: Re: Possible bug in zfs send or pipe implementation? To: Garrett Wollman Cc: freebsd-stable@freebsd.org, Mark Johnston Content-Type: multipart/alternative; boundary="0000000000003b04a7061d3c7185" X-Spamd-Bar: ---- X-Rspamd-Pre-Result: action=no action; module=replies; Message is reply to one we originated X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[]; TAGGED_FROM(0.00)[]; ASN(0.00)[asn:15169, ipnet:2607:f8b0::/32, country:US] X-Rspamd-Queue-Id: 4WMflN4Yfzz55dk --0000000000003b04a7061d3c7185 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Sun, Jul 14, 2024 at 10:32=E2=80=AFAM Garrett Wollman wrote: > < > said: > > > Just to clarify it, are you saying zfs is sleeping on "pipewr"? > > (There is also a msleep() for "pipbww" in pipe_write().) > > It is sleeping on pipewr, yes. > > [wollman@nfs-prod-11 ~]$ sysctl kern.ipc.pipekva > kern.ipc.pipekva: 536576 > [wollman@nfs-prod-11 ~]$ sysctl kern.ipc.maxpipekva > kern.ipc.maxpipekva: 2144993280 > > It's not out of KVA, it's just waiting for the `pv` process to wake up > and read more data. `pv` is single-threaded and blocked on "select". > > It doesn't always get stuck in the same place, which is why I'm > suspecting a lost wakeup somewhere. > This snippet from sys/kern/sys_pipe.c looks a little suspicious to me... /* * Direct copy, bypassing a kernel buffer. */ } else if ((size =3D rpipe->pipe_pages.cnt) !=3D 0) { if (size > uio->uio_resid) size =3D (u_int) uio->uio_resid; PIPE_UNLOCK(rpipe); error =3D uiomove_fromphys(rpipe->pipe_pages.ms, rpipe->pipe_pages.pos, size, uio); PIPE_LOCK(rpipe); if (error) break; nread +=3D size; rpipe->pipe_pages.pos +=3D size; rpipe->pipe_pages.cnt -=3D size; if (rpipe->pipe_pages.cnt =3D=3D 0) { rpipe->pipe_state &=3D ~PIPE_WANTW; wakeup(rpipe); } If it reads uio_resid bytes which is less than pipe_pages.cnt, no wakeup() occurs. I'd be tempted to try getting rid of the "if (rpipe->pipe_pages.cnt =3D=3D = 0)" and do the wakeup() unconditionally, to see if it helps? Because if the application ("pv" in this case) doesn't do another read() on the pipe before calling select(), no wakeup() is going to occur, because here's what pipe_write() does... /* * We have no more space and have something to offer, * wake up select/poll. */ pipeselwakeup(wpipe); wpipe->pipe_state |=3D PIPE_WANTW; pipeunlock(wpipe); error =3D msleep(wpipe, PIPE_MTX(rpipe), PRIBIO | PCATCH, "pipewr", 0); pipelock(wpipe, 0); if (error !=3D 0) break; continue; Note that, once in msleep(), no call to pipeselwakeup() will occur until it gets woken up. I think the current code assumes that the reader ("pv" in this case) will read all the data out of the pipe before calling select() again. Does it do that? rick ps: I've added markj@ as a cc, since he seems to have been the last guy involved in sys_pipe.c. > -GAWollman > > --0000000000003b04a7061d3c7185 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


<= div class=3D"gmail_quote">
On Sun, Jul= 14, 2024 at 10:32=E2=80=AFAM Garrett Wollman <wollman@bimajority.org> wrote:
<<On Sat, 13 Jul 2024 20:50:55= -0700, Rick Macklem <rick.macklem@gmail.com> said:

> Just to clarify it, are you saying zfs is sleeping on "pipewr&quo= t;?
> (There is also a msleep() for "pipbww" in pipe_write().)

It is sleeping on pipewr, yes.

[wollman@nfs-prod-11 ~]$ sysctl kern.ipc.pipekva
kern.ipc.pipekva: 536576
[wollman@nfs-prod-11 ~]$ sysctl kern.ipc.maxpipekva
kern.ipc.maxpipekva: 2144993280

It's not out of KVA, it's just waiting for the `pv` process to wake= up
and read more data.=C2=A0 `pv` is single-threaded and blocked on "sele= ct".

It doesn't always get stuck in the same place, which is why I'm
suspecting a lost wakeup somewhere.
This snippet from sys/kern/sys_= pipe.c looks a little suspicious to me...=C2=A0
=C2=A0 /*<= /span>
* Direct copy, bypassing a kernel buffer.
*/
} else if ((size =3D rpipe->pipe_pages.cnt) != =3D 0) {
if (size > uio->uio_resid)
<= /span>size =3D (u_int) uio->uio_resid;
PIPE_UNLOCK(rp= ipe);
error =3D uiomove_fromphys(rpipe->pipe_pages.ms,
=C2=A0 =C2=A0 rpi= pe->pipe_pages.pos, size, uio);
PIPE_LOCK(rpipe);
if (error)
break;
nr= ead +=3D size;
rpipe->pipe_pages.pos +=3D size;
rpipe->pipe_pages.cnt -=3D size;
if (rp= ipe->pipe_pages.cnt =3D=3D 0) {
rpipe->pipe_state= &=3D ~PIPE_WANTW;
= wakeup(rpipe);
= }
If it reads uio_resid bytes which is less = than pipe_pages.cnt, no
wakeup() occurs= .
I'd be tempted to try getting rid= of the "if (rpipe->pipe_pages.cnt =3D=3D 0)"
and do the wakeup() unconditionally, to see if it helps= ?

Because if the application ("pv" in this case) doesn't do= another read() on the
pipe before call= ing select(), no wakeup() is going to occur, because here's
what pipe_write() does...
/*
* We have no more space and have something to offer,
= * wake up select/poll.
*/
pipeselwakeup(wpipe);

wpipe->pipe_state |=3D PIPE_WANTW;=
pipeunlock(wpipe);
error =3D msleep(wpipe, PI= PE_MTX(rpipe),
=C2=A0 = =C2=A0 PRIBIO | PCATCH, "pipewr", 0);
pipelock(wpipe, 0);
if (error !=3D 0)
break;
continue;
Note that, once in msleep(), no call to pipeselwakeup= () will occur until
it gets woken up.

I = think the current code assumes that the reader ("pv" in this case= ) will
read all the data out of the pipe before calling select() = again.
Does it do that?

rick
p= s: I've added markj@ as a cc, since he seems to have been the last guy = involved
=C2=A0 =C2=A0 in sys_pipe.c.


-GAWollman

--0000000000003b04a7061d3c7185--