TCP stack lock contention with short-lived connections
Julien Charbon
jcharbon at verisign.com
Mon May 26 13:37:01 UTC 2014
Hi Navdeep
On 23/05/14 23:37, Navdeep Parhar wrote:
> On 05/23/14 13:52, Julien Charbon wrote:
>> On 23/05/14 14:06, Julien Charbon wrote:
>>> On 27/02/14 11:32, Julien Charbon wrote:
>>>> On 07/11/13 14:55, Julien Charbon wrote:
>>>>> On Mon, 04 Nov 2013 22:21:04 +0100, Julien Charbon
>>>>> <jcharbon at verisign.com> wrote:
>>>>>> I have put technical and how-to-repeat details in below
>>>>>> PR:
>>>>>>
>>>>>> kern/183659: TCP stack lock contention with short-lived
>>>>>> connections
>>>>>> http://www.freebsd.org/cgi/query-pr.cgi?pr=183659
>>>>>>
>>>>>> We are currently working on this performance improvement
>>>>>> effort; it will impact only the TCP locking strategy not
>>>>>> the TCP stack logic itself. We will share on freebsd-net
>>>>>> the patches we made for reviewing and improvement
>>>>>> propositions; anyway this change might also require enough
>>>>>> eyeballs to avoid tricky race conditions introduction in
>>>>>> TCP stack.
>>
>> Joined the two cumulative patches (tcp-scale-inp-list-v1.patch and
>> tcp-scale-pcbinfo-rlock-v1.patch) we discussed the most at BSDCan
>> 2014.
>>
>> [...]
>>
>> _However_ it would be a miracle that this change does not introduce
>> new race condition(s) (hence the 'alpha' tag in commit message).
>> Most of TCP stack locking strategy being now on inpcb lock
>> shoulders. That said, from our tests point of view, this change is
>> completely stable: No kernel/lock assertion, no unexpected TCP
>> behavior, stable performance results. Moreover, before tagging
>> this change as 'beta' we need to test more thoroughly these
>> features:
>>
>> - VNET, - PCBGROUP/RSS/TCP timer per cpu, - TCP Offloading (we need
>> a NIC with a good TCP offloading support)
>
> I can assess the impact (and fix any fallout) on the parts of the
> kernel that deal with TCP_OFFLOAD, and the TOE driver in
> dev/cxgbe/tom. But I was hoping to do that only after there was
> general agreement on net@ that these locking changes are sound and
> should be taken into HEAD. Lack of reviews seems to be holding this
> back, correct?
Correct, these changes have been reviewed and tested only internally
yet. As we aren't finding any issues, we share them for wider testing,
comments and reviews.
An advice for reviewers: When reading code don't be fooled by
remaining ipi_lock read lock (INP_INFO_RLOCK), you should consider
ipi_lock as completely deleted in TCP stack. All TCP code that was
under ipi_lock umbrella is now only protected by inp_lock. Just keep
that in mind.
Below, just for your information, more details on context of these
changes:
o The rough consensus at BSDCan was that there is a shared interest
for scalability improvement of TCP workloads with potential high rate of
connections establishment and tear-down.
o Our requirements for this task were:
- Achieve more than 100k TCP connections per second without dropping
a single packet in reception
- Use a strategy that does not require to change all network stacks
in a row (TCP, UDP, RAW, etc.)
- Be able to progressively introduce better performance, leveraging
already in place mutex strategy
- Keep the TCP stack stable (obviously)
o Solutions we did try to implement and that failed:
#1 Decompose ipi_lock in finer grained locks on ipi_hashbase's bucket
(i.e. add a mutex in struct inpcbhead). Did not work as the induced
change was quite big, and keeping network stacks shared code (in
in_pcb.{h, c}) clean was much more difficult than expected.
#2 Revert the lock ordering from:
ipi_lock > inpcb lock > ipi_hash_lock, pcbgroup locks
to:
inpcb lock > ipi_lock, ipi_hash_lock, pcbgroup locks
The change was just a all-or-nothing giant patch, it did not handle
the full inp list traversal functions and having a different lock
ordering between TCP stack and other network stacks was quite a
challenge to maintain.
My 2 cents.
--
Julien
More information about the freebsd-net
mailing list