TCP stack lock contention with short-lived connections

Fri Mar 7 12:58:40 UTC 2014

  Hi John,

On 06/03/14 22:57, John-Mark Gurney wrote:
> Julien Charbon wrote this message on Thu, Feb 27, 2014 at 11:32 +0100:
>> [...]
>>   Any thoughts on this particular behavior?
>
> One thing that I noticed is that you now lock/unlock the tw and inp lock a
> lot...  Have you thought about grabing the TW lock once, grabbing some/all
> of the ones necessary to process and then process them in a second step?
> If the bulk processing is still an issue, then you could process them in
> blocks of 50 or so, that way the number of lock/unlock cycles is reduced...

  First thanks, feedback are highly valuable to us.  In first place, I 
indeed tried a kind of bulk processing enforcement with something like:

  tcp_tw_2msl_scan() {

         struct tcptw *tw;
         int i, end = 0, count = 100;

         for (;;) {
                 INP_INFO_WLOCK(&V_tcbinfo);
                 for (i = 0; i < count; ++i) {
                         tw = TAILQ_FIRST(&V_twq_2msl);
                         if (tw == NULL || (tw->tw_time - ticks) > 0)) {
                                 end = 1;
                                 break;
                         }
                         INP_WLOCK(tw->tw_inpcb);
                         tcp_twclose(tw, 0);
                 }
                 if (end)
                         break;
                INP_INFO_WUNLOCK(&V_tcbinfo);
         }
         return (NULL);
}

  And I got best result with 'count' set to 1, this led us to current 
proposed patch.  Thus main goal here is somehow to prioritize the NIC 
interruption handler calls against tcp_tw_2msl_scan() call in INP_INFO 
battle.  As you proposed, we can add a sysctl(or a #define) to configure 
the maximum of tw objects to be cleaned up under a same INP_INFO_WLOCK() 
call, to have a finer control on how the tw objects are enforced.

  That said, our high level solution is to consider the NIC interruption 
code path (i.e. tcp_input()) as critical and remove almost most 
contention points on it, which is our long term goal.  This change is 
just a step on this (long and not straightforward) path.

>> +/*
>> + * Drop a refcount on an tw elevated using tw_pcbref().  If it is
>> + * valid, we return with the tw lock held.
>> + */
>
> I assume you mean that you return with the tw lock unlocked?  at least
> that's what the code reads to me...
>
>> +static int
>> +tw_pcbrele(struct tcptw *tw)
>> +{
>> +	TW_WLOCK_ASSERT(V_tw_lock);
>> +	KASSERT(tw->tw_refcount > 0, ("%s: refcount 0", __func__));
>> +
>> +	if (!refcount_release(&tw->tw_refcount)) {
>> +		TW_WUNLOCK(V_tw_lock);
>> +		return (0);
>> +	}
>> +
>> +	uma_zfree(V_tcptw_zone, tw);
>> +	TW_WUNLOCK(V_tw_lock);
>> +	return (1);
>> +}

  You are completely right, my mistake.  I will update the comment.

> Otherwise looks like a good patch...

  Thanks again for your time.

--
Julien