RFC: copy_file_range(3)

Fri Sep 25 16:26:47 UTC 2020

[the indentation seems to be a bit messed up, so I'll skip to near the end...]
On Wed, Sep 23, 2020 at 9:08 AM Rick Macklem <rmacklem at uoguelph.ca<mailto:rmacklem at uoguelph.ca>> wrote:
Rick Macklem wrote:
>Alan Somers wrote:
>[lots of stuff snipped]
>>1) In order to quickly respond to a signal, a program must use a modest len with >>copy_file_range
>For the programs you have mentioned, I think the only signal handling would
>be termination (<ctrl>C or SIGTERM if you prefer).
>I'm not sure what is a reasonable response time for this.
>I'd like to hear comments from others?
>- 1sec, less than 1sec, a few seconds, ...
>
>> 2) If a hole is larger than len, that will cause vn_generic_copy_file_range to
>> truncate the output file to the middle of the hole.  Then, in the next invocation,
>> truncate it again to a larger size.
>> 3) The result is a file that is not as sparse as the original.
>Yes. So, the trick is to use the largest "len" you can live with, given how long you
>are willing to wait for signal processing.
>
>> For example, on UFS:
>> $ truncate -s 1g sparsefile
>Not a very interesting sparse file. I wrote a little program to create one.
>> $ cp sparsefile sparsefile2
>> $ du -sh sparsefile*
>>  96K sparsefile
>>  32M sparsefile2
Btw, this happens because, at least for UFS (not sure about other file
systems), if you grow a file's size via VOP_SETATTR() of size, it allocates a
block at the new EOF, even though no data has been written there.
--> This results in one block being allocated at the end of the range used
    for a copy_file_range() call, if that file offset is within a hole.
    --> The larger the "len" argument, the less frequently it will occur.

>>
>> My idea for a userland wrapper would solve this problem by using
>> SEEK_HOLE/SEEK_DATA to copy holes in their entirety, and use copy_file_range for
>> everything else with a modest len.  Alternatively, we could eliminate the need for
>> the wrapper by enabling copy_file_range for every file system, and making
>> vn_generic_copy_file_range interruptible, so copy_file_range can be called with
>> large len without penalizing signal handling performance.
>
>Well, I ran some quick benchmarks using the attached programs, plus "cp" both
>before and with your copy_file_range() patch.
>copya - Does what I think your plan is above, with a limit of 2Mbytes for "len".
>copyb -Just uses copy_file_range() with 128Mbytes for "len".
>
>I first created the sparse file with createsparse.c. It is admittedly a worst case,
>creating alternating holes and data blocks of the minimum size supported by
>the file system. (I ran it on a UFS file system created with defaults, so the minimum
>>hole size is 32Kbytes.)
>The file is 1Gbyte in size with an Allocation size of 524576 ("ls -ls").
>
>I then ran copya, copyb, old-cp and new-cp. For NFS, I redid the mount before
>each copy to avoid data caching in the client.
>Here's what I got:
>                      Elapsed time           #RPCs                  Allocation size ("ls -ls" on server)
>NFSv4.2
>copya             39.7sec          16384copy+32768seek       524576
>copyb             10.2sec          104copy                              524576
When I ran the tests I had vfs.nfs.maxcopyrange set to 128Mbytes on the
server. However it was still the default of 10Mbytes on the client,
so this test run used 10Mbytes per Copy. (I wondered why it did 104 Copyies?)
With both set to 128Mbytes I got:
copyb                10.0sec          8copy                                  524576
>old-cp             21.9sec          16384read+16384write      1048864
>new-cp            10.5sec          1024copy                            524576
>
>NFSv4.1
>copya             21.8sec          16384read+16384write      1048864
>copyb             21.0sec          16384read+16384write      1048864
>old-cp             21.8sec          16384read+16384write      1048864
>new-cp           21.4sec           16384read+16384write      1048864
>
>Local on the UFS file system
>copya             9.2sec                       n/a                             524576
This turns out to be just variability in the test. I get 7.9sec->9.2sec
for runs of all three of copya, copyb and new-cp for UFS.
I think it is caching related, since I wasn't unmounting/remounting the
UFS file system between test runs.
>copyb             8.0sec                       n/a                             524576
>old-cp            15.9sec                      n/a                            1048864
>new-cp           7.9sec                        n/a                             524576
>
>So, for a NFSv4.2 mount, using SEEK_DATA/SEEK_HOLE is definitely
>a performance hit, due to all the RPC rtts.
>Your patched "cp" does fine, although a larger "len" reduces the
>RPC count against the server.
>All variants using copy_file_range() retain the holes.
>
>For NFSv4.1, it (not surprisingly) doesn't matter, since only NFSv4.2
>supports SEEK_DATA/SEEK_HOLE and VOP_COPY_FILE_RANGE().
>
>For UFS, everything using copy_file_range() works pretty well and
>retains the holes.

>Although "copya" is guaranteed to retain the holes, it does run noticably
>slower than the others. Not sure why? Does the extra SEEK_DATA/SEEK_HOLE
>syscalls cost that much?
Ignore this. It was just variability in the test runs.

>The limitation of not using SEEK_DATA/SEEK_HOLE is that you will not
>retain holes that straddle the byte range copied by two subsequent
>copy_file_range(2) calls.
This statement is misleading. These holes are partially retained, but there
will be a block allocated (at least for UFS) at the boundary, due the property of
growing a file via VOP_SETATTR(size) as noted above.

>--> This can be minimized by using a large "len", but that large "len"
>      results in slower response to signal handling.
I'm going to play with "len" to-day and come up with some numbers
w.r.t. signal handling response time vs the copy_file_range() "len" argument.

>I've attached the little programs, so you can play with them.
>(Maybe try different sparse schemes/sizes? It might be fun to
> make the holes/blocks some random multiple of hole size up
> to a limit?)
>
>rick
>ps: In case he isn't reading hackers these days, I've added kib@
>      as a cc. He might know why UFS is 15% slower when SEEK_HOLE
>      SEEK_DATA is used.
Alan Somers wrote:
> So it sounds like your main point is that for file systems with special support, 
> copy_file_range(2) is more efficient for many sparse files than 
> SEEK_HOLE/SEEK_DATA.
Well, for NFSv4.2 this is true. Who knows w.r.t. others in the future.

>  Sure, I buy that.  And secondarily, you don't see any reason not to increase the
> len argument in commands like cp up to somewhere around 128 MB, delaying 
> signal handling for about 1 second on a typical desktop (maybe set it lower on 
> embedded arches).
When I did some testing on my hardware (laptops with slow spinning disks),
I got up to about 2sec delay for 128Mbytes and up to about 1sec delay for
64Mbyes. I got a post that suggested that 1sec should be the target and
haven't heard differently from anyone else.

Currently, there is a sysctl for NFS that clips the size of a copy_file_range(),
so that RPC response is reasonable (1sec or less).
Maybe that sysctl should be replaced with a generic one for copy_file_range()
with a default of 64->128Mbytes. (I might make NFS use 1/2 of the sysctl
value, since the RPC response time shouldn't exceed 1sec.)
Does this sound reasonable?

>  And you think it's fine to allow copy_file_range on devfs, as long as the len 
> argument is clipped at some finite value.  If we make all of those changes, are
>  there any other reasons why the write/read fallback path would be needed?
I'm on the fence w.r.t. this one. I understand why you would prefer a call that
worked for special files, but I also like the idea that it is "Linux compatible".

I'd like to hear feedback from others on this.
Maybe I'll try asking this question separately on freebsd-current@ and
see if I can get others to respond.

rick'

-Alan