Much improved sendfile(2) kernel implementation
Robert Watson
rwatson at FreeBSD.org
Thu Sep 21 06:59:11 PDT 2006
On Thu, 21 Sep 2006, Andre Oppermann wrote:
>> There should be unconditional M_NOWAIT. Oops, the M_DONTWAIT in the current
>> code is incorrect. It is present since rev. 1.171. If the m_uiotombuf()
>> fails the current code returns from syscall without error! Before rev.
>> 1.171, there wasn't m_uiotombuf(), the mbuf header was allocated below,
>> with correct wait argument.
>>
>> The wait argument for m_uiotombuf() should be changed to M_WAITOK, but in a
>> separate commit.
<snip>
>> This one should be M_WAITOK always. It is M_TRYWAIT (equal to M_WAITOK) in
>> the current code.
>
> The reason why I changed the mbuf allocations with SS_NBIO is the rationale
> of sendfile() and the performance evaluation that was done by alc@ students.
> sendfile() has two flags which control its blocking behavior. Non blocking
> socket (SS_NBIO) and SF_NODISKIO. The latter is necessary because file
> reads or writes are normally not considered to be blocking. The most
> optimal sendfile() is usage is with a single process doing accept(), parsing
> and then sendfile that should never ever block on anything. This way the
> main process then can use kqueue for all the socket stuff and it can
> transfer all sends that require disk I/O to a child process or thread to
> provide a context for the read. Meanwhile the main process is free to
> accept further connections and to continue serving existing connections.
> Having sendfile() block in mbuf allocation for the header, on sfbufs or
> anything else is not desirable and must be avoided. I know I'm extending
> the traditional definition of SS_NBIO a bit but it's fully in line with the
> semantics and desired operational behavior of sendfile(). The paper by
> alc@'s students clearly identifies this as the main property of a sendfile
> implementation besides its zero copy nature.
The semantics with regard to waiting are a bit confusing, but the existing
model has a fairly specific meaning that has some benefits. Normally we have
three dispositions for a network I/O operation:
(1) Fully blocking -- the default disposition. The operation may block for
several reasons, but most usually due to either insufficient buffer
space/data in the socket buffer, insufficient memory for the kernel to
perform the operation (usually mbufs), or due to a user space page fault
in reading or writing the data.
(2) Non-blocking -- SS_NBIO, MSG_NBIO, etc. The operation will not block if
there is insufficient data/buffer space. Typically, this is aligned with
select()/poll()/kqueue()'s notion of data or space.
(3) Non-waiting -- MSG_DONTWAIT. The operation will not sleep in kernel for
any reason, either as part of I/O blocking, or for memory allocation. It
may still sleep if a page fault occurs, but as kernel senders send using
pinned kernel memory, this isn't an issue.
There are a few known bugs -- for example, in zero-copy mode, we may block
waiting for an sf_buf with MSG_DONTWAIT set (this used to be the case, haven't
checked lately). However, for applications, you typically run in (1) or (2)
of the above, where the notion of blocking is aligned with a notion of buffer
space or data, not with a notion of kernel sleeping. In particular, it has to
do with the definition used by select()/kqueue()/poll(). If you make SS_NBIO
sockets return immediately if there is no memory free for sendfile(), this
will be inconsistent with the normal behavior in which select() returning
writable means that you will be able to write -- so an application that shows
the socket as writable via select() might sit there spinning performing the
I/O operation, with it repeatedly returning an error saying it wasn't ready.
My feeling is that we should constrain absolutely non-sleeping to the
MSG_DONTWAIT case -- if desired, we could add SF_DONTWAIT to determine if
sleeping ever at all happens. SS_NBIO should not return an error in a limited
memory case, it should sleep waiting on memory, as sleeping (mutexes, memory
allocation, ...) is not considered blocking. Blocking should continue to
refer to the socket buffer-related behavior, and specifically sbwait().
However, we should fix any bugs in MSG_DONTWAIT for sosend/soreceive (and
hence sendmsg, recvmsg) that cause it to sleep improperly -- I'm not sure if
the zero-copy case still does it wrong, but that's potentially a problem if we
ever support zero-copy send from in kernel space, as sosend/soreceive can be
called while a mutex is held or in network interrupt context, hence needing
the flag.
Robert N M Watson
Computer Laboratory
University of Cambridge
More information about the freebsd-net
mailing list