RFC: copy_file_range(3)

Rick Macklem rmacklem at uoguelph.ca
Sun Sep 20 23:14:57 UTC 2020


Alan Somers wrote:
>On Sun, Sep 20, 2020 at 9:58 AM Rick Macklem <rmacklem at uoguelph.ca<mailto:rmacklem at uoguelph.ca>> wrote:
>>Alan Somers wrote:
>>>copy_file_range(2) is nifty, but it has a few sharp edges:
>>>1) Certain file systems don't support it, necessitating a write/read based
>>>fallback
>>>2) It doesn't handle sparse files as well as SEEK_HOLE/SEEK_DATA
>>>3) It's slightly tricky to both efficiently deal with holes and also
>>>promptly respond to signals
>>>
>>>These problems aren't terribly hard, but it seems to me like most
>>>applications that use copy_file_range would share the exact same
>>>solutions.  In particular, I'm thinking about cp(1), dd(1), and
>>>install(8).  Those three could benefit from sharing a userland wrapper that
>>>handles the above problems.
>>>
>>>Should we add such a wrapper to libc?  If so, what should it be called, and
>>>should it be public or just private to /usr/src ?
>>There has been a discussion on src-committers which I suggested should
>>be taken to a public mailing list.
>>
>>The basic question is...
>>Whether or not the copy_file_range(2) syscall should be compatible with
>>the Linux one.
>>When I did the syscall, I tried to make it Linux-compatible, arguing that
>>Linux is now a de-facto standard.
>>The Linux syscall only works on regular files, which is why Alan's patch for
>>cp required a "fallback to the old way" for VCHR files like /dev/null.
>>
>>He is considering a wrapper in libc to provide FreeBSD specific semantics,
>>which I have no problem with, so long as the naming and man page make
>>it clear that it is not compatible with the Linux syscall.
>>(Personally, I'd prefer a wrapper in libc to making the actual syscall non-Linux
>> compatible, but that is just mho.)
>>
>>Hopefully this helps clarify what Alan is asking, rick
>>
>>I don't think the two questions are equivalent.  I think that copy_file_range(2) >>ought to work on character devices.  Separately, even it does, I think a userland >>wrapper would still be useful.  It would still be able to handle sparse files more >>efficiently than the kernel-based vn_generic_copy_file_range.
I saw this also stated in your #2 above, but wonder why you think a wrapper
would handle holes more efficiently.
vn_generic_copy_file_range() does look for holes via SEEK_DATA/SEEK_HOLE
just like a wrapper would and retains them as far as possible. It also looks
for blocks of all zero bytes for file systems that do not support SEEK_DATA/
SEEK_HOLE (like NFS versions prior to 4.2) and creates holes for these in
the output file.
--> The only cases that I am aware of where the holes are not retained are:
     - When the min holesize for the output file is larger than that of the
       input file.
     - When the hole straddles the byte range specified for the syscall.
       (Or when the hole straddles two copy_file_range(2) syscalls, if you
        prefer.)

If you are copying the entire file and do not care how long the syscall
takes (which also implies how long it will take for a termination signal
like <ctrl>C to be handled), the most efficient usage is to specify
a "len" argument equal to UINT64_MAX.
--> This will usually copy the whole file in one gulp, although it is not
       guaranteed to copy everything, given the Linux semantics definition
       of it (an NFSv4.2 server can simply choose to copy less, for example).
       --> This allows the kernel to use whatever block size works efficiently
             and does not require an allocation of a large userspace buffer for
             the date, nor that the data be copied to/from userspace.

The problem with doing the whole file in one gulp are:
- A large file can take quite a while and any signal won't be processed until
  the gulp is done.
  --> If you wrote a program that allocated a 100Gbyte buffer and then
        copied a file using read(2)/write(2) with a size of 100Gbytes in a loop,
        you'd end up with the same result.
- As kib@ noted, if the input file never reports EOF (as /dev/zero does),
      then the "one gulp" wouldn't end until storage is exhausted on the
      output file(s) device and <crtl>C wouldn't stop it (since it is one big
      syscall).
     --> As such, I suggested that, if the syscall is extended to allow VCHR,
           that the "len" argument be clipped at "K Mbytes" for that case to
           avoid filling the storage device before being able to <ctrl>C out
           of it, for this case.
I suppose the answer for #3 is...
- smaller "len" allows for quicker response to signals
but
- smaller "len" results in less efficient use of the syscall.

Your patch for "cp" seemed fine, but used a small "len" and, as such,
made the use of copy_file_range(2) less efficient.

All I see the wrapper dong is handling the VCHR case (if the syscall remains
as it is now and returns EINVAL to be compatible with Linux) and making
some rather arbitrary choice w.r.t. how big "len" should be.
--> Choosing an appropriate "len" might better be left to the specific use
      case, I think?

In summary, it's mostly whether VCHR gets handled by the syscall or a
wrapper?

rick

-Alan


More information about the freebsd-hackers mailing list