Re: Possible bug in zfs send or pipe implementation?

From: Eric A. Borisch <eborisch+FreeBSD_at_gmail.com>
Date: Tue, 16 Jul 2024 23:41:33 UTC
On Tue, Jul 16, 2024 at 4:41 PM Jason Tubnor <jason@tubnor.net> wrote:

> Confirmed. Removing PV from the pipeline has solved my issue. I was able
> to move 20TB around in the last 18 hours without issue.
>
> Sent from my iPhone
>
> On 15 Jul 2024, at 6:42 PM, Jason Tubnor <jason@tubnor.net> wrote:
>
> 
>
>
> On 15/07/2024 9:35 am, Garrett Wollman wrote:
>
> I'm currently running syncoid with `--quiet`, which removes `pv` from
> the pipeline.  It seems to be moving along, but of course I can't
> easily tell what it's doing.
>
> Without `pv` in the way, the process reading the pipe is `lzop`
> instead, which doesn't try to do any fancy select() stuff, it's just a
> normal filter reading data from stdin and writing compressed data to
> stdout, in sequence.
>
> I'm having the same issue here. I thought it was just my hardware and/or
> syncoid that was having this issue on 3.5TB dataset replications.
>
> I'll try the quiet mode to see if removing PV solves the problem. I'm also
> using compression=none so I won't see lzop.
>
> Cheers.
>
>
So I thought I had things pinned down in the debugger, but to clarify: write()
is _not_ returning 0; there’s an alarm going off that is breaking out of
the write loop while total_written is 0. They (pv) are returning (within
pv’s write loop) 0 — and while that’s completely valid if they’ve set an
alarm and it goes off, they are interpreting it as “we can’t write to the
file anymore.”

I’m pretty sure the other users on here experiencing errors with pv in the
zfs send|recv path are hitting this.

What is interesting is that select() is indicating that the stdout is
writable, but when pv tries to write on it, it blocks and makes no
progress. You can see this with this reproducer:

$ pv /dev/zero | { while sleep 1.1; do dd bs=63k count=1 of=/dev/null; done
; }
1+0 records in1 [62.4KiB/s] [    <=>            ]
1+0 records out
64512 bytes transferred in 0.001132 secs (57005867 bytes/sec)
0+1 records in2 [0.00  B/s] [    <=>             ]
0+1 records out
1024 bytes transferred in 0.000218 secs (4689890 bytes/sec)
64.0KiB 0:00:11 [0.00  B/s] [    <=>
<Never gets past 64KB>

Looking at the ktrace, and focusing on the selects and writes on stdout, we
have:

  9684 pv       CALL
 select(0x2,0x820c27060,0x820c26fe0,0x820c26f60,0x820c26f50)
  9684 pv       RET   select 1

#### This is "writable on stdout" based on what happens next

  9684 pv       CALL  setitimer(ITIMER_REAL,0x820c27060,0)
  9684 pv       STRU  itimerval { .interval = {1, 0}, .value = {1, 0} }
  9684 pv       STRU  itimerval { .interval = {0, 0}, .value = {0, 0} }
  9684 pv       RET   setitimer 0
  9684 pv       CALL  write(0x1,0xe084aa3a000,0x20000)

#### Tries to write 64k

  9684 pv       GIO   fd 1 wrote 4096 bytes

 "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\
....
  9684 pv       RET   write 65536/0x10000

#### Wrote 64k to stdout. This is all that ever makes it out

[skip] (read; display update, etc.)

  9684 pv       CALL
 select(0x2,0x820c27060,0x820c26fe0,0x820c26f60,0x820c26f50)
  9684 pv       RET   select 1

#### This is again "writable on stdout" based on what happens next

  9684 pv       CALL  setitimer(ITIMER_REAL,0x820c27060,0)
  9684 pv       STRU  itimerval { .interval = {1, 0}, .value = {1, 0} }
  9684 pv       STRU  itimerval { .interval = {0, 0}, .value = {0, 0} }
  9684 pv       RET   setitimer 0
  9684 pv       CALL  write(0x1,0xe084aa3a000,0x20000)

#### Tries to write 64k (since select() indicated it was writable)

  9684 pv       GIO   fd 1 wrote 4096 bytes

 "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\
....
  9684 pv       RET   write -1 errno 4 Interrupted system call

#### Gets -1 / EINTR (pv sets up a 1s. timer), so no progress write()-ing
at all after the successful select().

So it is a bit odd to return the FD as writable, but then block on a write
to that FD / not make any progress. It appears Linux (if I compile pv
without splice()) does a partial write to fill up the buffer,
(total_written is then > 0) and then blocks on the subsequent call and ends
up with -1/EINTR in the write loop, likewise on MacOS.

Do we intend to return a FD as writable from select() only to have it block
and not progress on the next attempted write()? I imagine there is a
smaller write that would succeed at this point.

Thanks,
   - Eric