From nobody Tue Jul 16 23:41:33 2024 X-Original-To: freebsd-stable@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4WNwZf6KB3z5QV8x for ; Tue, 16 Jul 2024 23:41:46 +0000 (UTC) (envelope-from eborisch@gmail.com) Received: from mail-qv1-xf33.google.com (mail-qv1-xf33.google.com [IPv6:2607:f8b0:4864:20::f33]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "WR4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4WNwZd5fLBz4WTF for ; Tue, 16 Jul 2024 23:41:45 +0000 (UTC) (envelope-from eborisch@gmail.com) Authentication-Results: mx1.freebsd.org; dkim=pass header.d=gmail.com header.s=20230601 header.b=Ad2GttrM; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (mx1.freebsd.org: domain of eborisch@gmail.com designates 2607:f8b0:4864:20::f33 as permitted sender) smtp.mailfrom=eborisch@gmail.com Received: by mail-qv1-xf33.google.com with SMTP id 6a1803df08f44-6b5dd7cd945so31358566d6.1 for ; Tue, 16 Jul 2024 16:41:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1721173304; x=1721778104; darn=freebsd.org; h=cc:to:subject:message-id:date:from:mime-version:from:to:cc:subject :date:message-id:reply-to; bh=LibgtovcoT3o120H6yF9UuY19Ry4jvn9pqZ28BWvHZE=; b=Ad2GttrMQAzeDxMPs7wyrEnDYiF5WgLaii2WUcnRh+3JetbWRQ0qjZOhz+PhLVO4gq 9ETWFcRePiQU15ujJgkSvpIIOWvS1pwgcismQLsrFhUmoFUDA40sPA2SJTZ64P9AJzSY YGtXZ8YZzjNU9g5JvmlfH4TeHrrvc4RylzvtZMcMN0swdG87xA790wWYxuTXRyY7b9Ys UvgZ8ge6BPKHVPjj0mPqnYqZJ/JN5p5JhU/N0FnPAjn5rBCM+qi8dd2clUPk2u2D4aoU rRztkpvxxX7HewhJs/4RsIhbTqyYgPOB5xBWuiPUTzaHky5OU5pm1J23hnfytBURbKsw 2W9w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1721173304; x=1721778104; h=cc:to:subject:message-id:date:from:mime-version:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=LibgtovcoT3o120H6yF9UuY19Ry4jvn9pqZ28BWvHZE=; b=JiXiUsWEsZQDxV/ROj3MjtvhuMHKlcD4XzYp1MbWiMD0MK5iMZVIXyhThtPl31IM5Q /FyJ3dL5E8Gt9YpgUD1U9ZxEolcy7zL2VOPIiOrRv7RIc0gpRbMcPslRls5bgn+KDh4A 0YAomcXbttqHYvJ7g37WRPaCM7L2x3Tw42wMa59yUq+KGSKdVx7l0v2FWMj6KIIZ2T9I Sxt9UraFZNb4ODSl8MeEmSzxF9RaPHkp4dVkv5yTkrUk2gN21YXVXI+OB1SferyoGTF+ XZ9rwMGZp6eLFsZ+tVjkw1Wqrkinb+ZuiyTzyWAyK0Y7Caal/NHXnzAUOIGYxqXIa5VQ cuQA== X-Forwarded-Encrypted: i=1; AJvYcCXsKiNLx1rVn31wTSKsWbYJBstN7MH23+3l6yFzZFHG0nI3TOGjOUr+9eEfYMB6AnfxvUurTia4nFoQlQtMJYnFPJGZVna4zbx4gw== X-Gm-Message-State: AOJu0Yxuv2vJFglIw41Z7lsK1d80AcGlP+sLoIo+PnDWlPYv9W1n9yKA lmq76g0igtmFSd8Z/4CMazZUpyfVVe/KItOkIy/3eBYikWfOeJjxfX2zyqQFw6VNqUaKRPL/VlO HRUM4vnki2vCd7Bsty4FOVVzNiEQ= X-Google-Smtp-Source: AGHT+IEbQKDXghQQMxgnLXf8LkF6KdyW9AyXPC6Liz/ocACaSM0/5cM1+G7OM6v/DYpUmHMSTjvKSVjDIbnVe2lg1MI= X-Received: by 2002:a05:6214:1cce:b0:6b5:23eb:3a49 with SMTP id 6a1803df08f44-6b78cab3f6dmr1688116d6.25.1721173304511; Tue, 16 Jul 2024 16:41:44 -0700 (PDT) List-Id: Production branch of FreeBSD source code List-Archive: https://lists.freebsd.org/archives/freebsd-stable List-Help: List-Post: List-Subscribe: List-Unsubscribe: X-BeenThere: freebsd-stable@freebsd.org Sender: owner-freebsd-stable@FreeBSD.org MIME-Version: 1.0 From: "Eric A. Borisch" Date: Tue, 16 Jul 2024 18:41:33 -0500 Message-ID: Subject: Re: Possible bug in zfs send or pipe implementation? To: jason@tubnor.net Cc: Garrett Wollman , rick.macklem@gmail.com, FreeBSD Stable Content-Type: multipart/alternative; boundary="0000000000006996ec061d65e301" X-Spamd-Bar: / X-Spamd-Result: default: False [-0.21 / 15.00]; SUSPICIOUS_RECIPS(1.50)[]; SUBJECT_ENDS_QUESTION(1.00)[]; FAKE_REPLY(1.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; NEURAL_HAM_LONG(-1.00)[-1.000]; NEURAL_HAM_SHORT(-0.71)[-0.710]; DMARC_POLICY_ALLOW(-0.50)[gmail.com,none]; R_DKIM_ALLOW(-0.20)[gmail.com:s=20230601]; R_SPF_ALLOW(-0.20)[+ip6:2607:f8b0:4000::/36]; MIME_GOOD(-0.10)[multipart/alternative,text/plain]; FREEMAIL_FROM(0.00)[gmail.com]; DWL_DNSWL_NONE(0.00)[gmail.com:dkim]; TAGGED_FROM(0.00)[FreeBSD]; RCVD_TLS_LAST(0.00)[]; MIME_TRACE(0.00)[0:+,1:+,2:~]; FREEMAIL_CC(0.00)[bimajority.org,gmail.com,freebsd.org]; TO_DN_SOME(0.00)[]; ARC_NA(0.00)[]; FROM_HAS_DN(0.00)[]; MISSING_XM_UA(0.00)[]; RCPT_COUNT_THREE(0.00)[4]; PREVIOUSLY_DELIVERED(0.00)[freebsd-stable@freebsd.org]; MLMMJ_DEST(0.00)[freebsd-stable@freebsd.org]; TO_MATCH_ENVRCPT_SOME(0.00)[]; FROM_EQ_ENVFROM(0.00)[]; DKIM_TRACE(0.00)[gmail.com:+]; MID_RHS_MATCH_FROMTLD(0.00)[]; ASN(0.00)[asn:15169, ipnet:2607:f8b0::/32, country:US]; TAGGED_RCPT(0.00)[]; FREEMAIL_ENVFROM(0.00)[gmail.com]; RCVD_COUNT_ONE(0.00)[1]; RCVD_IN_DNSWL_NONE(0.00)[2607:f8b0:4864:20::f33:from] X-Rspamd-Queue-Id: 4WNwZd5fLBz4WTF --0000000000006996ec061d65e301 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Tue, Jul 16, 2024 at 4:41=E2=80=AFPM Jason Tubnor wro= te: > Confirmed. Removing PV from the pipeline has solved my issue. I was able > to move 20TB around in the last 18 hours without issue. > > Sent from my iPhone > > On 15 Jul 2024, at 6:42=E2=80=AFPM, Jason Tubnor wrote= : > > =EF=BB=BF > > > On 15/07/2024 9:35 am, Garrett Wollman wrote: > > I'm currently running syncoid with `--quiet`, which removes `pv` from > the pipeline. It seems to be moving along, but of course I can't > easily tell what it's doing. > > Without `pv` in the way, the process reading the pipe is `lzop` > instead, which doesn't try to do any fancy select() stuff, it's just a > normal filter reading data from stdin and writing compressed data to > stdout, in sequence. > > I'm having the same issue here. I thought it was just my hardware and/or > syncoid that was having this issue on 3.5TB dataset replications. > > I'll try the quiet mode to see if removing PV solves the problem. I'm als= o > using compression=3Dnone so I won't see lzop. > > Cheers. > > So I thought I had things pinned down in the debugger, but to clarify: writ= e() is _not_ returning 0; there=E2=80=99s an alarm going off that is breaking o= ut of the write loop while total_written is 0. They (pv) are returning (within pv=E2=80=99s write loop) 0 =E2=80=94 and while that=E2=80=99s completely va= lid if they=E2=80=99ve set an alarm and it goes off, they are interpreting it as =E2=80=9Cwe can=E2=80=99= t write to the file anymore.=E2=80=9D I=E2=80=99m pretty sure the other users on here experiencing errors with pv= in the zfs send|recv path are hitting this. What is interesting is that select() is indicating that the stdout is writable, but when pv tries to write on it, it blocks and makes no progress. You can see this with this reproducer: $ pv /dev/zero | { while sleep 1.1; do dd bs=3D63k count=3D1 of=3D/dev/null= ; done ; } 1+0 records in1 [62.4KiB/s] [ <=3D> ] 1+0 records out 64512 bytes transferred in 0.001132 secs (57005867 bytes/sec) 0+1 records in2 [0.00 B/s] [ <=3D> ] 0+1 records out 1024 bytes transferred in 0.000218 secs (4689890 bytes/sec) 64.0KiB 0:00:11 [0.00 B/s] [ <=3D> Looking at the ktrace, and focusing on the selects and writes on stdout, we have: 9684 pv CALL select(0x2,0x820c27060,0x820c26fe0,0x820c26f60,0x820c26f50) 9684 pv RET select 1 #### This is "writable on stdout" based on what happens next 9684 pv CALL setitimer(ITIMER_REAL,0x820c27060,0) 9684 pv STRU itimerval { .interval =3D {1, 0}, .value =3D {1, 0} } 9684 pv STRU itimerval { .interval =3D {0, 0}, .value =3D {0, 0} } 9684 pv RET setitimer 0 9684 pv CALL write(0x1,0xe084aa3a000,0x20000) #### Tries to write 64k 9684 pv GIO fd 1 wrote 4096 bytes "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\= 0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0= \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\= 0\0\0\0\0\0\0\0\0\ .... 9684 pv RET write 65536/0x10000 #### Wrote 64k to stdout. This is all that ever makes it out [skip] (read; display update, etc.) 9684 pv CALL select(0x2,0x820c27060,0x820c26fe0,0x820c26f60,0x820c26f50) 9684 pv RET select 1 #### This is again "writable on stdout" based on what happens next 9684 pv CALL setitimer(ITIMER_REAL,0x820c27060,0) 9684 pv STRU itimerval { .interval =3D {1, 0}, .value =3D {1, 0} } 9684 pv STRU itimerval { .interval =3D {0, 0}, .value =3D {0, 0} } 9684 pv RET setitimer 0 9684 pv CALL write(0x1,0xe084aa3a000,0x20000) #### Tries to write 64k (since select() indicated it was writable) 9684 pv GIO fd 1 wrote 4096 bytes "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\= 0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0= \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\= 0\0\0\0\0\0\0\0\0\ .... 9684 pv RET write -1 errno 4 Interrupted system call #### Gets -1 / EINTR (pv sets up a 1s. timer), so no progress write()-ing at all after the successful select(). So it is a bit odd to return the FD as writable, but then block on a write to that FD / not make any progress. It appears Linux (if I compile pv without splice()) does a partial write to fill up the buffer, (total_written is then > 0) and then blocks on the subsequent call and ends up with -1/EINTR in the write loop, likewise on MacOS. Do we intend to return a FD as writable from select() only to have it block and not progress on the next attempted write()? I imagine there is a smaller write that would succeed at this point. Thanks, - Eric --0000000000006996ec061d65e301 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


On Tue, Jul 16, 20= 24 at 4:41=E2=80=AFPM Jason Tubnor <= jason@tubnor.net> wrote:
Confirmed. Removing PV from the pipeline has solved my issue. I was able t= o move 20TB around in the last 18 hours without issue.

Sent from = my iPhone

On 15 Jul 202= 4, at 6:42=E2=80=AFPM, Jason Tubnor <jason@tubnor.net> wrote:

=EF=BB=BF


On 15= /07/2024 9:35 am, Garrett Wollman wrote:
I'm currently running syncoid with `--quiet`, which removes `pv` =
from
the pipeline.  It seems to be moving along, but of course I can't
easily tell what it's doing.

Without `pv` in the way, the process reading the pipe is `lzop`
instead, which doesn't try to do any fancy select() stuff, it's jus=
t a
normal filter reading data from stdin and writing compressed data to
stdout, in sequence.

I'm having the same issue here. I thought it was = just my hardware and/or syncoid that was having this issue on 3.5TB dataset= replications.

I'll try the quiet mode to see if removing PV solv= es the problem. I'm also using compression=3Dnone so I won't see lz= op.

Cheers.


So I thought I had t= hings pinned down in the debugger, but to clarify:=C2=A0write() is _not_ returning 0; there=E2= =80=99s an alarm going off that is breaking out of the write loop while tot= al_written is 0. They (pv) are returning (within pv=E2=80=99s write loop) 0= =E2=80=94 and while that=E2=80=99s completely valid if they=E2=80=99ve set= an alarm and it goes off, they are interpreting it as =E2=80=9Cwe can=E2= =80=99t write to the file anymore.=E2=80=9D
<= span style=3D"font-family:-apple-system,helveticaneue">
I=E2= =80=99m pretty sure the other users on here experiencing errors with pv in = the zfs send|recv path are hitting this.

<= span style=3D"font-family:-apple-system,helveticaneue">What is interesting = is that select() is indicating that the stdout is writable, but when pv tri= es to write on it, it blocks and makes no progress. You can see this with t= his reproducer:

$ pv /dev/zero | { while sleep 1.1; do = dd bs=3D63k count=3D1 of=3D/dev/null; done ; }
1+0 records in1 [62.4KiB/s= ] [ =C2=A0 =C2=A0<=3D> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0]
= 1+0 records out
64512 bytes transferred in 0.001132 secs (57005867 bytes= /sec)
0+1 records in2 [0.00 =C2=A0B/s] [ =C2=A0 =C2=A0<=3D> =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ]
0+1 records out
1024 bytes tran= sferred in 0.000218 secs (4689890 bytes/sec)
64.0KiB 0:00:11 [0.00 =C2= =A0B/s] [ =C2=A0 =C2=A0<=3D>
<Never gets past 64KB&g= t;

Looking at the ktrace, and focusing on the sele= cts and writes on stdout, we have:

=C2=A0 9684 pv = =C2=A0 =C2=A0 =C2=A0 CALL =C2=A0select(0x2,0x820c27060,0x820c26fe0,0x820c26= f60,0x820c26f50)
=C2=A0 9684 pv =C2=A0 =C2=A0 =C2=A0 RET =C2=A0 select 1=

#### This is "writable on stdout" based= on what happens next

=C2=A0 9684 pv =C2=A0 =C2=A0= =C2=A0 CALL =C2=A0setitimer(ITIMER_REAL,0x820c27060,0)
=C2=A0 9684 pv = =C2=A0 =C2=A0 =C2=A0 STRU =C2=A0itimerval { .interval =3D {1, 0}, .value = =3D {1, 0} }
=C2=A0 9684 pv =C2=A0 =C2=A0 =C2=A0 STRU =C2=A0itimerval { = .interval =3D {0, 0}, .value =3D {0, 0} }
=C2=A0 9684 pv =C2=A0 =C2=A0 = =C2=A0 RET =C2=A0 setitimer 0
=C2=A0 9684 pv =C2=A0 =C2=A0 =C2=A0 CALL = =C2=A0write(0x1,0xe084aa3a000,0x20000)

#### T= ries to write 64k

=C2=A0 9684 pv =C2=A0 =C2=A0 =C2=A0 G= IO =C2=A0 fd 1 wrote 4096 bytes
=C2=A0 =C2=A0 =C2=A0 =C2=A0"\0\0\0\= 0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0= \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\= 0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0= \0\0\0\0\
....
=C2=A0 9684 pv =C2=A0 =C2=A0 =C2=A0 = RET =C2=A0 write 65536/0x10000

#### Wrote= 64k to stdout. This is all that ever makes it out

[skip] (read; display update, etc.)

=C2=A0 9684 pv =C2=A0 =C2=A0 =C2=A0 CALL =C2=A0sele= ct(0x2,0x820c27060,0x820c26fe0,0x820c26f60,0x820c26f50)
=C2=A0 9684 pv = =C2=A0 =C2=A0 =C2=A0 RET =C2=A0 select 1

####= This is again "writable on stdout" based on what happens next

=C2=A0 9684 pv =C2=A0 =C2=A0 =C2=A0 CALL =C2= =A0setitimer(ITIMER_REAL,0x820c27060,0)
=C2=A0 9684 pv =C2=A0 =C2=A0 =C2= =A0 STRU =C2=A0itimerval { .interval =3D {1, 0}, .value =3D {1, 0} }
=C2= =A0 9684 pv =C2=A0 =C2=A0 =C2=A0 STRU =C2=A0itimerval { .interval =3D {0, 0= }, .value =3D {0, 0} }
=C2=A0 9684 pv =C2=A0 =C2=A0 =C2=A0 RET =C2=A0 se= titimer 0
=C2=A0 9684 pv =C2=A0 =C2=A0 =C2=A0 CALL =C2=A0write(0x1,0xe08= 4aa3a000,0x20000)

#### Tries to write 64k (si= nce select() indicated it was writable)

=C2=A0 9684 pv = =C2=A0 =C2=A0 =C2=A0 GIO =C2=A0 fd 1 wrote 4096 bytes
=C2=A0 =C2=A0 =C2= =A0 =C2=A0"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\= 0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0= \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\= 0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\
....
=C2=A0 9684 p= v =C2=A0 =C2=A0 =C2=A0 RET =C2=A0 write -1 errno 4 Interrupted system call<= br>

#### Gets -1 / EINTR (pv sets up a 1s. timer),= so no progress write()-ing at all after the successful select().

So it is a bit odd to return the FD as writable, but then b= lock on a write to that FD / not make any progress. It appears Linux (if I = compile pv without splice()) does a partial write to fill up the buffer, (t= otal_written is then > 0) and then blocks on the subsequent call and end= s up with -1/EINTR in the write loop,=C2=A0likewise on MacOS.
Do we intend to return a FD as writable from select() only to h= ave it block and not progress on the next attempted write()? I imagine ther= e is a smaller write that would succeed at this point.

=
Thanks,
=C2=A0 =C2=A0- Eric

--0000000000006996ec061d65e301--