CACHE_LINE_SIZE on x86

Thu Nov 1 18:36:15 UTC 2012

On Thu, Nov 1, 2012 at 7:44 AM, Andre Oppermann <andre at freebsd.org> wrote:

> On 01.11.2012 01:50, Jim Harris wrote:
>
>>
>>
>> On Thu, Oct 25, 2012 at 2:40 PM, Jim Harris <jim.harris at gmail.com<mailto:
>> jim.harris at gmail.com>> wrote:
>>
>>
>>     On Thu, Oct 25, 2012 at 2:32 PM, John Baldwin <jhb at freebsd.org<mailto:
>> jhb at freebsd.org>> wrote:
>>      >
>>      > It would be good to know though if there are performance benefits
>> from
>>      > avoiding sharing across paired lines in this manner.  Even if it
>> has
>>      > its own MOESI state, there might still be negative effects from
>> sharing
>>      > the pair.
>>
>>     On 2S, I do see further benefits by using 128 byte padding instead of
>>     64.  On 1S, I see no difference.  I've been meaning to turn off
>>     prefetching on my system to see if it has any effect in the 2S case -
>>     I can give that a shot tomorrow.
>>
>>
>> So tomorrow turned into next week, but I have some data finally.
>>
>> I've updated to HEAD from today, including all of the mtx_padalign
>> changes.  I tested 64 v. 128 byte
>> alignment on 2S amd64 (SNB Xeon).  My BIOS also has a knob to disable the
>> adjacent line prefetching
>> (MLC spatial prefetcher), so I ran both 64b and 128b against this
>> specific prefetcher both enabled
>> and disabled.
>>
>> MLC prefetcher enabled: 3-6% performance improvement, 1-5% decrease in
>> CPU utilization by using 128b
>> padding instead of 64b.
>>
>
> Just to be sure.  The numbers you show are just for the one location you've
> converted to the new padded mutex and a particular test case?
>

There are two locations actually - the struct tdq lock in the ULE
scheduler, and the callout_cpu lock in kern_timeout.c.

And yes, I've been only running a custom benchmark I developed here to help
to try to uncover some of these areas of spinlock contention.  It was
originally used for NVMe driver performance testing, but has been helpful
in uncovering some other issues outside of the NVMe driver itself (such as
these contended spinlocks).  It spawns a large number of kernel threads,
each of which submits an I/O and then sleeps until it is woken by the
interrupt thread when the I/O completes.  It stresses the scheduler and
also callout since I start and stop a timer for each I/O.

I think the only thing proves is that there is benefit to having x86
CACHE_LINE_SIZE still set to 128.

Thanks,

-Jim