svn commit: r238755 - head/sys/x86/x86
Jim Harris
jim.harris at gmail.com
Wed Jul 25 15:29:59 UTC 2012
On Wed, Jul 25, 2012 at 6:37 AM, Konstantin Belousov
<kostikbel at gmail.com> wrote:
> On Wed, Jul 25, 2012 at 03:29:34PM +0300, Andriy Gapon wrote:
>> on 25/07/2012 13:21 Konstantin Belousov said the following:
>> > On Wed, Jul 25, 2012 at 10:20:02AM +0300, Andriy Gapon wrote:
>> >> on 25/07/2012 01:10 Jim Harris said the following:
>> >>> Author: jimharris
>> >>> Date: Tue Jul 24 22:10:11 2012
>> >>> New Revision: 238755
>> >>> URL: http://svn.freebsd.org/changeset/base/238755
>> >>>
>> >>> Log:
>> >>> Add rmb() to tsc_read_##x to enforce serialization of rdtsc captures.
>> >>>
>> >>> Intel Architecture Manual specifies that rdtsc instruction is not serialized,
>> >>> so without this change, TSC synchronization test would periodically fail,
>> >>> resulting in use of HPET timecounter instead of TSC-low. This caused
>> >>> severe performance degradation (40-50%) when running high IO/s workloads due to
>> >>> HPET MMIO reads and GEOM stat collection.
>> >>>
>> >>> Tests on Xeon E5-2600 (Sandy Bridge) 8C systems were seeing TSC synchronization
>> >>> fail approximately 20% of the time.
>> >>
>> >> Should rather the synchronization test be fixed if it's the culprit?
>> > Synchronization test for what ?
>>
>> The synchronization test mentioned above.
>> So, oops, very sorry - I missed the fact that the change was precisely in the
>> test. I confused it for another place where tsc is used. Thank you for pointing
>> this out.
>>
>> >> Or is this change universally good for the real uses of TSC?
>> >
>> > What I understood from the Intel SDM, and also from additional experiments
>> > which Jim kindly made despite me being annoying as usual, is that 'read
>> > memory barrier' AKA LFENCE there is used for its secondary implementation
>> > effects, not for load/load barrier as you might assume.
>> >
>> > According to SDM, LFENCE fully drains execution pipeline (but comparing
>> > with MFENCE, does not drain write buffers). The result is that RDTSC is
>> > not started before previous instructions are finished.
>>
>> Yes, I am fully aware of this.
>>
>> > For tsc test, this means that after the change RDTSC executions are not
>> > reordered on the single core among themself. As I understand, CPU has
>> > no dependency noted between two reads of tsc by RDTSC, which allows
>> > later read to give lower value of counter. This is fixed by Intel by
>> > introduction of RDTSCP instruction, which is defined to be serialization
>> > point, and use of which (instead of LFENCE; RDTSC sequence) also fixes
>> > test, as confirmed by Jim.
>>
>> Yes. I think that previously Intel recommended to precede rdtsc with cpuid for
>> all the same reasons. Not sure if there is any difference performance-wise
>> comparing to lfence.
>> Unfortunately, rdtscp is not available on all CPUs, so using it would require
>> extra work.
>>
>> > In fact, I now think that we should also apply the following patch.
>> > Otherwise, consequtive calls to e.g. binuptime(9) could return decreased
>> > time stamps. Note that libc __vdso_gettc.c already has LFENCE nearby the
>> > tsc reads, which was done not for this reason, but apparently needed for
>> > the reason too.
>> >
>> > diff --git a/sys/x86/x86/tsc.c b/sys/x86/x86/tsc.c
>> > index 085c339..229b351 100644
>> > --- a/sys/x86/x86/tsc.c
>> > +++ b/sys/x86/x86/tsc.c
>> > @@ -594,6 +594,7 @@ static u_int
>> > tsc_get_timecount(struct timecounter *tc __unused)
>> > {
>> >
>> > + rmb();
>> > return (rdtsc32());
>> > }
>> >
>>
>> This makes sense to me. We probably want correctness over performance here.
>> [BTW, I originally thought that the change was here; brain malfunction]
>>
>> > @@ -602,8 +603,9 @@ tsc_get_timecount_low(struct timecounter *tc)
>> > {
>> > uint32_t rv;
>> >
>> > + rmb();
>> > __asm __volatile("rdtsc; shrd %%cl, %%edx, %0"
>> > - : "=a" (rv) : "c" ((int)(intptr_t)tc->tc_priv) : "edx");
>> > + : "=a" (rv) : "c" ((int)(intptr_t)tc->tc_priv) : "edx");
>> > return (rv);
>> > }
>> >
>>
>> It would correct here too, but not sure if it would make any difference given that
>> some lower bits are discarded anyway. Probably depends on exact CPU.
>>
>>
>> And, oh hmm, I read AMD Software Optimization Guide for AMD Family 10h
>> Processors and they suggest using cpuid (with a note that it may be
>> intercepted in virtualized environments) or _mfence_ in the discussed
>> role (Appendix F of the document). Googling for 'rdtsc mfence lfence'
>> yields some interesting results.
> Yes, MFENCE for AMD.
>
> Since I was infected with these Google results anyway, I looked at the
> Linux code. Apparently, they use MFENCE on amd, and LFENCE on Intel.
> They also use LFENCE on VIA, it seems. Intel documentation claims that
> MFENCE does not serialize instruction execution, which is contrary to
> used LFENCE behaviour.
>
> So we definitely want to add some barrier right before rdtsc. And we do
> want LFENCE for Intels. Patch below ends with the following code:
>
> Dump of assembler code for function tsc_get_timecount_lfence:
> 0xffffffff805563a0 <+0>: push %rbp
> 0xffffffff805563a1 <+1>: mov %rsp,%rbp
> 0xffffffff805563a4 <+4>: lfence
> 0xffffffff805563a7 <+7>: rdtsc
> 0xffffffff805563a9 <+9>: leaveq
> 0xffffffff805563aa <+10>: retq
> Dump of assembler code for function tsc_get_timecount_low_lfence:
> 0xffffffff805556a0 <+0>: push %rbp
> 0xffffffff805556a1 <+1>: mov %rsp,%rbp
> 0xffffffff805556a4 <+4>: lfence
> 0xffffffff805556a7 <+7>: mov 0x30(%rdi),%rcx
> 0xffffffff805556ab <+11>: rdtsc
> 0xffffffff805556ad <+13>: shrd %cl,%edx,%eax
> 0xffffffff805556b0 <+16>: leaveq
> 0xffffffff805556b1 <+17>: retq
> which I would qualify as perfect outcome.
>
> Patch below still leaves libc only suitable for Intels and VIA, not for AMD,
> but I will take care of it later.
>
> diff --git a/sys/x86/x86/tsc.c b/sys/x86/x86/tsc.c
> index c253a96..06cfe10 100644
> --- a/sys/x86/x86/tsc.c
> +++ b/sys/x86/x86/tsc.c
> @@ -83,6 +83,10 @@ static void tsc_freq_changing(void *arg, const struct cf_level *level,
> int *status);
> static unsigned tsc_get_timecount(struct timecounter *tc);
> static unsigned tsc_get_timecount_low(struct timecounter *tc);
> +static unsigned tsc_get_timecount_lfence(struct timecounter *tc);
> +static unsigned tsc_get_timecount_low_lfence(struct timecounter *tc);
> +static unsigned tsc_get_timecount_mfence(struct timecounter *tc);
> +static unsigned tsc_get_timecount_low_mfence(struct timecounter *tc);
> static void tsc_levels_changed(void *arg, int unit);
>
> static struct timecounter tsc_timecounter = {
> @@ -262,6 +266,10 @@ probe_tsc_freq(void)
> (vm_guest == VM_GUEST_NO &&
> CPUID_TO_FAMILY(cpu_id) >= 0x10))
> tsc_is_invariant = 1;
> + if (cpu_feature & CPUID_SSE2) {
> + tsc_timecounter.tc_get_timecount =
> + tsc_get_timecount_mfence;
> + }
> break;
> case CPU_VENDOR_INTEL:
> if ((amd_pminfo & AMDPM_TSC_INVARIANT) != 0 ||
> @@ -271,6 +279,10 @@ probe_tsc_freq(void)
> (CPUID_TO_FAMILY(cpu_id) == 0xf &&
> CPUID_TO_MODEL(cpu_id) >= 0x3))))
> tsc_is_invariant = 1;
> + if (cpu_feature & CPUID_SSE2) {
> + tsc_timecounter.tc_get_timecount =
> + tsc_get_timecount_lfence;
> + }
> break;
> case CPU_VENDOR_CENTAUR:
> if (vm_guest == VM_GUEST_NO &&
> @@ -278,6 +290,10 @@ probe_tsc_freq(void)
> CPUID_TO_MODEL(cpu_id) >= 0xf &&
> (rdmsr(0x1203) & 0x100000000ULL) == 0)
> tsc_is_invariant = 1;
> + if (cpu_feature & CPUID_SSE2) {
> + tsc_timecounter.tc_get_timecount =
> + tsc_get_timecount_lfence;
> + }
> break;
> }
>
> @@ -328,7 +344,15 @@ init_TSC(void)
>
> #ifdef SMP
>
> -/* rmb is required here because rdtsc is not a serializing instruction. */
> +/*
> + * RDTSC is not a serializing instruction, so we need to drain
> + * instruction stream before executing it. It could be fixed by use of
> + * RDTSCP, except the instruction is not available everywhere.
> + *
> + * Insert both MFENCE for AMD CPUs, and LFENCE for others (Intel and
> + * VIA), and assume that SMP test is only performed on CPUs that have
> + * SSE2 anyway.
> + */
> #define TSC_READ(x) \
> static void \
> tsc_read_##x(void *arg) \
> @@ -337,6 +361,7 @@ tsc_read_##x(void *arg) \
> u_int cpu = PCPU_GET(cpuid); \
> \
> rmb(); \
> + mb(); \
> tsc[cpu * 3 + x] = rdtsc32(); \
I've seen bde@'s comments, so perhaps this patch will not move
forward, but I'm wondering if it would make sense here to just call
the new tsc_get_timecount_mfence() function rather than explicitly
call mb() and then rdtsc32().
> }
> TSC_READ(0)
> @@ -487,7 +512,16 @@ init:
> for (shift = 0; shift < 31 && (tsc_freq >> shift) > max_freq; shift++)
> ;
> if (shift > 0) {
> - tsc_timecounter.tc_get_timecount = tsc_get_timecount_low;
> + if (cpu_feature & CPUID_SSE2) {
> + if (cpu_vendor_id == CPU_VENDOR_AMD) {
> + tsc_timecounter.tc_get_timecount =
> + tsc_get_timecount_low_mfence;
> + } else {
> + tsc_timecounter.tc_get_timecount =
> + tsc_get_timecount_low_lfence;
> + }
> + } else
> + tsc_timecounter.tc_get_timecount = tsc_get_timecount_low;
> tsc_timecounter.tc_name = "TSC-low";
> if (bootverbose)
> printf("TSC timecounter discards lower %d bit(s)\n",
> @@ -592,23 +626,55 @@ sysctl_machdep_tsc_freq(SYSCTL_HANDLER_ARGS)
> SYSCTL_PROC(_machdep, OID_AUTO, tsc_freq, CTLTYPE_U64 | CTLFLAG_RW,
> 0, 0, sysctl_machdep_tsc_freq, "QU", "Time Stamp Counter frequency");
>
> -static u_int
> +static inline u_int
> tsc_get_timecount(struct timecounter *tc __unused)
> {
>
> return (rdtsc32());
> }
>
> -static u_int
> +static inline u_int
> tsc_get_timecount_low(struct timecounter *tc)
> {
> uint32_t rv;
>
> __asm __volatile("rdtsc; shrd %%cl, %%edx, %0"
> - : "=a" (rv) : "c" ((int)(intptr_t)tc->tc_priv) : "edx");
> + : "=a" (rv) : "c" ((int)(intptr_t)tc->tc_priv) : "edx");
> return (rv);
> }
>
> +static inline u_int
> +tsc_get_timecount_lfence(struct timecounter *tc __unused)
> +{
> +
> + rmb();
> + return (rdtsc32());
> +}
> +
> +static inline u_int
> +tsc_get_timecount_low_lfence(struct timecounter *tc)
> +{
> +
> + rmb();
> + return (tsc_get_timecount_low(tc));
> +}
> +
> +static inline u_int
> +tsc_get_timecount_mfence(struct timecounter *tc __unused)
> +{
> +
> + mb();
> + return (rdtsc32());
> +}
> +
> +static inline u_int
> +tsc_get_timecount_low_mfence(struct timecounter *tc)
> +{
> +
> + mb();
> + return (tsc_get_timecount_low(tc));
> +}
> +
> uint32_t
> cpu_fill_vdso_timehands(struct vdso_timehands *vdso_th)
> {
More information about the svn-src-head
mailing list