Latency issues with buf_ring

Sun Nov 18 17:26:40 UTC 2012

--- On Thu, 1/19/12, John Baldwin <jhb at freebsd.org> wrote:

> From: John Baldwin <jhb at freebsd.org>
> Subject: Latency issues with buf_ring
> To: net at freebsd.org
> Cc: "Ed Maste" <emaste at freebsd.org>, "Navdeep Parhar" <np at freebsd.org>
> Date: Thursday, January 19, 2012, 11:41 AM
> The current buf_ring usage in various
> NIC drivers has a race that can
> result in really high packet latencies in some cases. 
> Specifically,
> the common pattern in an if_transmit routine is to use a
> try-lock on
> the queue and if that fails enqueue the packet in the
> buf_ring and
> return.  The race, of course, is that the thread
> holding the lock
> might have just finished checking the buf_ring and found it
> empty and
> be in the process of releasing the lock when the original
> thread fails
> the try lock.  If this happens, then the packet queued
> by the first
> thread will be stalled until another thread tries to
> transmit packets
> for that queue.  Some drivers attempt to handle this
> race (igb(4)
> schedules a task to kick the transmit queue if the try lock
> fails) and
> others don't (cxgb(4) doesn't handle it at all).  At
> work this race
> was triggered very often after upgrading from 7 to 8 with
> bursty
> traffic and caused numerous problems, so it is not a rare
> occurrence
> and needs to be addressed.
> 
> (Note, all patches mentioned are against 8)
> 
> The first hack I tried to use was to simply always lock the
> queue after
> the drbr enqueue if the try lock failed and then drain the
> queue if
> needed (www.freebsd.org/~jhb/patches/try_fail.patch). 
> While this fixed
> my latency problems, it would seem that this breaks other
> workloads
> that the drbr design is trying to optimize.
> 
> After further hacking what I came up with was a variant of
> drbr_enqueue()
> that would atomically set a 'pending' flag.  During the
> enqueue operation.
> The first thread to fail the try lock sets this flag (it is
> told that it
> set the flag by a new return value (EINPROGRESS) from the
> enqueue call).
> The pending thread then explicitly clears the flag once it
> acquires the
> queue lock.  This should prevent multiple threads from
> stacking up on the
> queue lock so that if multiple threads are dumping packets
> into the ring
> concurrently all but two (the one draining the queue
> currently and the
> one waiting for the lock) can continue to drain the
> queue.  One downside
> of this approach though is that each driver has to be
> changed to make
> an explicit call to clear the pending flag after grabbing
> the queue lock
> if the try lock fails.  This is what I am currently
> running in production
> (www.freebsd.org/~jhb/patches/try_fail3.patch).
> 
> However, this still results in a lot of duplicated code in
> each driver
> that wants to support multiq.  Several folks have
> expressed a desire
> to move in a direction where the stack has explicit
> knowledge of
> transmit queues allowing us to hoist some of this duplicated
> code out
> of the drivers and up into the calling layer.  After
> discussing this a
> bit with Navdeep (np@), the approach I am looking at is to
> alter the
> buf_ring code flow a bit to more closely model the older
> code-flow
> with IFQ and if_start methods.  That is, have the
> if_transmit methods
> always enqueue each packet that arrives to the buf_ring and
> then to
> call an if_start-like method that drains a specific transmit
> queue.
> This approach simplifies a fair bit of driver code and means
> we can
> potentially move the enqueue, etc. bits up into the calling
> layer and
> instead have drivers provide the per-transmit queue start
> routine as
> the direct function pointer to the upper layers ala
> if_start.
> 
> However, we would still need a way to close the latency
> race.  I've
> attempted to do that by inverting my previous 'thread
> pending' flag.
> Instead, I make the buf_ring store a 'busy' flag.  This
> flag is
> managed by the single-consumer buf_ring dequeue method
> (that
> drbr_dequeue() uses).  It is set to true when a packet
> is removed from
> the queue while there are more packets pending. 
> Conversely, if there
> are no other pending packets then it is set to false. 
> The assumption
> is that once a thread starts draining the queue, it will not
> stop
> until the queue is empty (or if it has to stop for some
> other reason
> such as the transmit ring being full, the driver will
> restart draining
> of the queue until it is empty, e.g. after it receives a
> transmit
> completion interrupt).  Now when the if_transmit
> routine enqueues the
> packet, it will get either a real error, 0 if the packet was
> enqueued
> and the queue was not idle, or EINPROGRESS if the packet was
> enqueued
> and the queue was busy.  For the EINPROGRESS case the
> if_transmit
> routine just returns success.  For the 0 case it does a
> blocking lock
> on the queue lock and calls the queue's start routine (note
> that this
> means that the busy flag is similar to the old OACTIVE
> interface
> flag).  This does mean that in some cases you may have
> one thread that
> is sending what was the last packet in the buf_ring holding
> the lock
> when another thread blocks, and that the first thread will
> see the new
> packet when it loops back around so that the second thread
> is wasting
> it's time spinning, but in the common case I believe it will
> give the
> same parallelism as the current code.  OTOH, there is
> nothing to
> prevent multiple threads from "stacking up" in the new
> approach.  At
> least the try_fail3 patch ensured only one thread at a time
> would ever
> potentially block on the queue lock.
> 
> Another approach might be to replace the 'busy' flag with
> the 'thread
> pending' flag from try_fail3.patch, but to clear the 'thread
> pending'
> flag anytime the dequeue method is called rather than using
> an
> explicit 'clear pending' method.  (Hadn't thought of
> that until
> writing this e-mail.)  That would prevent multiple
> threads from
> waiting on the queue lock perhaps.
> 
> Note that the 'busy' approach (or the modification I
> mentioned above)
> does rely on the assumption I stated above, i.e. once a
> driver starts
> draining a queue, it will drain it until empty unless it
> hits an
> "error" condition (link went down, transmit ring full,
> etc.).  If it
> hits an "error" condition, the driver is responsible for
> restarting
> transmit when the condition clears.  I believe our
> drivers already
> work this way now.
> 
> The 'busy' patch is at http://www.freebsd.org/~jhb/patches/drbr.patch
> 
> -- 
> John Baldwin

Q1: Has this been corrected?

Q2: Are there any case studies or benchmarks for buf_ring, or it is just
blindly being used because someone claimed it was better and offered it
for free? One of the points of locking is to avoid race conditions, so the fact that you have races in a supposed lock-less scheme seems more than just ironic.

BC