svn commit: r238755 - head/sys/x86/x86
Bruce Evans
brde at optusnet.com.au
Thu Jul 26 08:02:38 UTC 2012
On Wed, 25 Jul 2012, Konstantin Belousov wrote:
> On Thu, Jul 26, 2012 at 12:15:54AM +1000, Bruce Evans wrote:
>> On Wed, 25 Jul 2012, Konstantin Belousov wrote:
>> ...
>> Most uses in FreeBSD are for timecounters. Timecounters deliver the
>> current time. This is unrelated to whatever instructions haven't
>> completed when the TSC is read. Except possibly when the time needs
>> to be synchronized across CPUs, and when the uncompleted instruction
>> is a TSC read.
>>
>>> For tsc test, this means that after the change RDTSC executions are not
>>> reordered on the single core among themself. As I understand, CPU has
>>> no dependency noted between two reads of tsc by RDTSC, which allows
>>> later read to give lower value of counter.
>>
>> Gak. Even when they are in the same instruction sequence? Even though
>> the TSC reads fixed registers and some other instructions in the sequence
>> between the TSC use these registers? The CPU would have to do significant
>> register renaming to break this.
> As I could only speculate, I believe that any modern CPU executes RDTSC
> as at least two separate steps, one is read from internal counter, and
> second is the registers update. It seems that the first kind of action
> is not serialized. I have no other explanation for the Jim findings.
In a reply to your later mail (made earlier), I quoted the Athlon64
manual documenting this problem (everything except exactly where the
serialization is applied). The delay is similar to what happens in
software if the thread is preempted between reading the hardware time
and using the result. It doesn't help to serializing the read and
the use without serializing everything between, which costs more.
Most uses don't care about the delay (else they need more than
serialization to limit it). But if we care then we might have to
use a slow new instruction like rdtscp to tell the hardware to care,
or add slow locking to uses of the result in software (needs more
than critical_enter() to stop fast interrupt handlers. BTW,
binuptime() is supposed to work in fast interrupt handlers. This
is fragile but useful).
>>> {
>>>
>>> + rmb();
>>> return (rdtsc32());
>>> }
>>
>> Please don't pessimize this further. The time for rdtsc went from 6.5
>> cycles on AthlonXP to 65 cycles on core2 (mainly for for
>> P-state-invariance hardware synchronization I think). Pretty soon it
>> will be as slow as an HPET and heading towards an i8254. Adding rmb()
>> only makes it 12 cycles slower on core2, but 16 cycles (almost 3 times)
>> slower on AthlonXP.
> AthlonXP does not look as interesting target for optimizations. Fom what I
> can find this is PIII-era CPU.
Since CPUs hit the frequency wall just after AthlonXP, it is almost
as fast as a single modern CPU. Much faster than a modern CPU for rdtsc,
and already optimized. Probably much faster than a PIII for systemy
things like rdtsc.
Bruce
More information about the svn-src-head
mailing list