svn commit: r238755 - head/sys/x86/x86

Jim Harris jim.harris at gmail.com
Wed Jul 25 18:00:43 UTC 2012


On Wed, Jul 25, 2012 at 10:32 AM, Konstantin Belousov
<kostikbel at gmail.com> wrote:
> On Thu, Jul 26, 2012 at 12:15:54AM +1000, Bruce Evans wrote:
>> On Wed, 25 Jul 2012, Konstantin Belousov wrote:
>>
>> >On Wed, Jul 25, 2012 at 10:20:02AM +0300, Andriy Gapon wrote:
>> >>on 25/07/2012 01:10 Jim Harris said the following:
>> >>>Author: jimharris
>> >>>Date: Tue Jul 24 22:10:11 2012
>> >>>New Revision: 238755
>> >>>URL: http://svn.freebsd.org/changeset/base/238755
>> >>>
>> >>>Log:
>> >>>  Add rmb() to tsc_read_##x to enforce serialization of rdtsc captures.
>> >>>
>> >>>  Intel Architecture Manual specifies that rdtsc instruction is not
>> >>>  serialized,
>> >>>  so without this change, TSC synchronization test would periodically
>> >>>  fail,
>> >>>  resulting in use of HPET timecounter instead of TSC-low.  This caused
>> >>>  severe performance degradation (40-50%) when running high IO/s
>> >>>  workloads due to
>> >>>  HPET MMIO reads and GEOM stat collection.
>> >>>
>> >>>  Tests on Xeon E5-2600 (Sandy Bridge) 8C systems were seeing TSC
>> >>>  synchronization
>> >>>  fail approximately 20% of the time.
>> >>
>> >>Should rather the synchronization test be fixed if it's the culprit?
>> >Synchronization test for what ?
>> >
>> >>Or is this change universally good for the real uses of TSC?
>>
>> It's too slow for real uses.  But synchronization code, and some uses
>> that requires serialization may need it for, er, synchronization and
>> serialization.
>>
>> It's hard to think of many uses that need serialization.  I often use
>> it for timing instructions.  For timng a large number of instructions,
>> serialization doesn't matter since errors of a few tens in a few billion
>> done matter.  For timing a small number of instructions, I don't want
>> serialization, since the serialization invalidates the timing.
>>
>> Most uses in FreeBSD are for timecounters.  Timecounters deliver the
>> current time.  This is unrelated to whatever instructions haven't
>> completed when the TSC is read.  Except possibly when the time needs
>> to be synchronized across CPUs, and when the uncompleted instruction
>> is a TSC read.
>>
>> >For tsc test, this means that after the change RDTSC executions are not
>> >reordered on the single core among themself. As I understand, CPU has
>> >no dependency noted between two reads of tsc by RDTSC, which allows
>> >later read to give lower value of counter.
>>
>> Gak.  Even when they are in the same instruction sequence?  Even though
>> the TSC reads fixed registers and some other instructions in the sequence
>> between the TSC use these registers?  The CPU would have to do significant
>> register renaming to break this.
> As I could only speculate, I believe that any modern CPU executes RDTSC
> as at least two separate steps, one is read from internal counter, and
> second is the registers update. It seems that the first kind of action
> is not serialized. I have no other explanation for the Jim findings.
>
> I also asked Jim to test whether the cause the TSC sync test failure
> is the lack of synchronization between gathering data and tasting it,
> but ut appeared that the reason is genuine timecounter value going
> backward.

I wonder if instead of timecounter going backward, that TSC test
fails because CPU speculatively performs rdtsc instruction in relation
to waiter checks in smp_rendezvous_action.  Or maybe we are saying
the same thing.

>
> Sp the bug seems real, and I cannot imagine we will live with the known
> defect in timecounters which can step back.
>>
>> >This is fixed by Intel by
>> >introduction of RDTSCP instruction, which is defined to be serialization
>> >point, and use of which (instead of LFENCE; RDTSC sequence) also fixes
>> >test, as confirmed by Jim.
>>
>> This is not a fix if it is full serialization.  It just gives slowness
>> using a single instruction instead of a couple.
>>
>> >In fact, I now think that we should also apply the following patch.
>> >Otherwise, consequtive calls to e.g. binuptime(9) could return decreased
>> >time stamps. Note that libc __vdso_gettc.c already has LFENCE nearby the
>> >tsc reads, which was done not for this reason, but apparently needed for
>> >the reason too.
>> >
>> >diff --git a/sys/x86/x86/tsc.c b/sys/x86/x86/tsc.c
>> >index 085c339..229b351 100644
>> >--- a/sys/x86/x86/tsc.c
>> >+++ b/sys/x86/x86/tsc.c
>> >@@ -594,6 +594,7 @@ static u_int
>> >tsc_get_timecount(struct timecounter *tc __unused)
>> >{
>> >
>> >+    rmb();
>> >     return (rdtsc32());
>> >}
>>
>> Please don't pessimize this further.  The time for rdtsc went from 6.5
>> cycles on AthlonXP to 65 cycles on core2 (mainly for for
>> P-state-invariance hardware synchronization I think).  Pretty soon it
>> will be as slow as an HPET and heading towards an i8254.  Adding rmb()
>> only makes it 12 cycles slower on core2, but 16 cycles (almost 3 times)
>> slower on AthlonXP.
> AthlonXP does not look as interesting target for optimizations. Fom what I
> can find this is PIII-era CPU.
>
>>
>> >@@ -602,8 +603,9 @@ tsc_get_timecount_low(struct timecounter *tc)
>> >{
>> >     uint32_t rv;
>> >
>> >+    rmb();
>> >     __asm __volatile("rdtsc; shrd %%cl, %%edx, %0"
>> >-    : "=a" (rv) : "c" ((int)(intptr_t)tc->tc_priv) : "edx");
>> >+        : "=a" (rv) : "c" ((int)(intptr_t)tc->tc_priv) : "edx");
>> >     return (rv);
>> >}
>>
>> The previous TSC-low/shrd pessimization adds only 2 cycles on AthlonXP
>> and core2.  I think it only "works" by backing the TSC's resolution
>> so low that it usually can't see its own, or at least other TSC's lack of
>> serialness.  The shift count is usually 7 or 8, so the resolution is
>> reduced from 1 cycle to 128 or 256.  Out of order times that fall in
>> the same block of 128 or 256 cycles would appear to be the same, but
>> out of order times like 129 and 127 would apear to be even more out
>> of order after a shift of 7 turns them into 128 and 0.
>>
>> Bruce


More information about the svn-src-head mailing list